Open rosemaryjoconnor opened 2 months ago
10/09/2024
11/09/2024
Databox Load
Production Load
13/09/2024
Check data resource after SOLR Index
Record counts
13/09/2024
Issue with encoding seems to be due to Pandas. Niels has said that using duckDB there is no problem reading the data. This may be something we need to look into.
Counts for new records are not correct. Have rerun with Load_dataset, made a mistake and ran ingest_large_dataset.
14/09/2024
dr10574 - successfully loaded via load_dataset
dr1376 - successfully loaded via load_dataset
[ ] Check index tomorrow
dr10574 Tasmania Herbarium - TMAG uploads directory dr376 - Melbourne Herbarium - IPT
Both failing with encoding errors. TMAG is likely a non-utf8 line break character, dr376 just a non-utf8 character at a specific location. Option to clean up the data in preingestion prior to load is not implemented yet.
Solution: run load_dataset with Herbarium/IPT datasets as per NZ herbarium, don't use pre-ingestion.