AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

Merge Data Load : dr348(https://collections.ala.org.au/public/show/dr348) #1047

Closed cha801p closed 2 months ago

cha801p commented 2 months ago

Ticket Update: April 5, 2024 (3 PM)

Issue: Data Refresh

Solution: Successfully load the new dataset on test and prod

Actions Taken: [x] Data review

[x] Columns renamed separately for each collection

[x] Columns dropped

[x] Columns rearranged [x] Duplicates deleted [x] Date format fix [x] location column fixed - weird characters coming in [x] Collection data concatenated [x] DwcA created locally [x] Loaded the data on collectory-test [x] Triggered preingest on databox - FAILED

Problems encountered: ValueError: Duplicate names are not allowed.

[x] Ran SOLR_Dataset_Indexing on databox [x] Logs checked to confirm everything looks alright

[x] Loaded the data on collectory [x] Triggered preingest on prod

Loaded data for review: Metadata(https://collections.ala.org.au/public/show/dr348) data (https://biocache.ala.org.au/occurrences/search?q=data_resource_uid:dr348)

PROD Logs: 24/04/05 02:25:37 INFO ALAUUIDMintingPipeline: Checking the percentage change in new UUIDs: 24/04/05 02:25:37 INFO ALAUUIDMintingPipeline: newUuids: 26971.0, preservedUuids: 760390.0, orphanedUniqueKeys: 8392.0 24/04/05 02:25:37 INFO ALAUUIDMintingPipeline: Percentage UUID change: 3, allowed percentage: 50, override percentage check: false 24/04/05 02:25:37 INFO ALAUUIDMintingPipeline: Backing up existing UUIDs to hdfs:///pipelines-data/dr348/1/identifiers/ala_uuid_backup_1712283937721 24/04/05 02:25:37 INFO ALAUUIDMintingPipeline: Pipeline complete.

Useful Stats from prod before indexing: (testing purposes) 519,454 records returned of 715,410 for Data resource: Western Australian Museum provider for OZCAM

Exclude spatially suspect records (8,837 records excluded) Exclude records based on scientific name quality (35,323 records excluded) Exclude records with additional spatial quality issues (1,573 records excluded) Exclude duplicate records (2,843 records excluded) Exclude records based on location uncertainty (151,685 records excluded) Exclude records with unresolved user annotations (103 records excluded) Exclude records that are environmental outliers (2,445 records excluded) Exclude records based on record type (0 records excluded) Exclude absence records (0 records excluded) Exclude records pre-1700 (0 records excluded)

**Status:

**

peggynewman commented 2 months ago

Brilliant, thanks for documenting all of this Raj.

peggynewman commented 2 months ago

Let's raise a new ticket if there is an issue on our side. Closing.