Herbarium encoding issues - dr10574, dr376

AtlasOfLivingAustralia / data-management

Data management issue tracking

7 stars 0 forks source link

Herbarium encoding issues - dr10574, dr376 #1105

Open rosemaryjoconnor opened 2 months ago

rosemaryjoconnor commented 2 months ago

dr10574 Tasmania Herbarium - TMAG uploads directory dr376 - Melbourne Herbarium - IPT

Both failing with encoding errors. TMAG is likely a non-utf8 line break character, dr376 just a non-utf8 character at a specific location. Option to clean up the data in preingestion prior to load is not implemented yet.

Solution: run load_dataset with Herbarium/IPT datasets as per NZ herbarium, don't use pre-ingestion.

rosemaryjoconnor commented 2 months ago

10/09/2024

Both loaded in databox successfully
NK to check before load to Production

rosemaryjoconnor commented 2 months ago

11/09/2024

Databox Load

[x] dr10574: Tasmanian Herbarium
[x] dr376 - Melbourne Herbarium

Production Load

[x] dr10574: Tasmanian Herbarium
[x] dr376 - Melbourne Herbarium

rosemaryjoconnor commented 2 months ago

13/09/2024

Check data resource after SOLR Index

[x] dr10574: Tasmanian Herbarium
[x] dr376 - Melbourne Herbarium

Record counts

[x] dr10574: Tasmanian Herbarium Old: 277, 493 New: 277, 493
[x] dr376 - Melbourne Herbarium Old: 1,070,469 New: 1,068,409

rosemaryjoconnor commented 2 months ago

13/09/2024

Issue with encoding seems to be due to Pandas. Niels has said that using duckDB there is no problem reading the data. This may be something we need to look into.

Counts for new records are not correct. Have rerun with Load_dataset, made a mistake and ran ingest_large_dataset.

dr376 loaded successfully just need to wait for index run on Monday night
dr10574 still having issue

rosemaryjoconnor commented 2 months ago

14/09/2024

dr10574 - successfully loaded via load_dataset
dr1376 - successfully loaded via load_dataset
[ ] Check index tomorrow