AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

Herbarium encoding issues - dr10574, dr376 #1105

Open rosemaryjoconnor opened 2 months ago

rosemaryjoconnor commented 2 months ago

dr10574 Tasmania Herbarium - TMAG uploads directory dr376 - Melbourne Herbarium - IPT

Both failing with encoding errors. TMAG is likely a non-utf8 line break character, dr376 just a non-utf8 character at a specific location. Option to clean up the data in preingestion prior to load is not implemented yet.

Solution: run load_dataset with Herbarium/IPT datasets as per NZ herbarium, don't use pre-ingestion.

rosemaryjoconnor commented 2 months ago

10/09/2024

rosemaryjoconnor commented 2 months ago

11/09/2024

Databox Load

Production Load

rosemaryjoconnor commented 2 months ago

13/09/2024

Check data resource after SOLR Index

Record counts

rosemaryjoconnor commented 2 months ago

13/09/2024

Issue with encoding seems to be due to Pandas. Niels has said that using duckDB there is no problem reading the data. This may be something we need to look into.

Counts for new records are not correct. Have rerun with Load_dataset, made a mistake and ran ingest_large_dataset.

rosemaryjoconnor commented 2 months ago

14/09/2024