NSW AVH feed - Data Refresh

cha801p commented 3 months ago

Issue: Data Refresh - NSW AVH

Solution: Successfully load the dataset on biocache-test

Actions Taken:

[x] Data review
[x] Date format fixed
[x] DwCA created locally
[x] File uploaded to collectory-test
[x] Triggered pre-ingestion

Issues Encountered: The formatting of the data was odd and hence failed to read the data using UTF-8 encoding Attempted reading data using multiple encodings Experimented converting TXT file to CSV to read data - Unsuccessful SOLR_dataset_indexing FAILED with the following error: 24/07/04 02:21:12 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.

_Error from server at http://aws-solr-test-2.ala:8983/solr/biocache-2024-06-04-06-20_shard2_replica_n3: ERROR: [doc=0909f47f-11d8-4bf0-ba32-e6157f790a8f] Error adding field 'elevationSource'='Collector' msg=For input string: "Collector"_

Troubleshooting: Column elevationSource had all NaN values except the last row, which had the value Collector in it. This column was deleted to eliminate the error and the issue was reported to the Systems

Logs on Test 24/07/04 01:45:22 INFO ALAUUIDMintingPipeline: Checking the percentage change in new UUIDs: 24/07/04 01:45:22 INFO ALAUUIDMintingPipeline: newUuids: 13613.0, preservedUuids: 772750.0, orphanedUniqueKeys: 1886.0 24/07/04 01:45:22 INFO ALAUUIDMintingPipeline: Percentage UUID change: 1, allowed percentage: 50, override percentage check: false

cha801p commented 3 months ago

Issue: Data Refresh - NSW AVH

Solution: Successfully load the dataset on biocache

Actions Taken: Data review DwCA created locally File uploaded to collectory Triggered pre-ingestion

Links: Metadata: data

Stats: Old occurrence count - 772,750 records Current occurrence count - 786,363 records

Status: Waiting for images to be uploaded to images.ala Re-ingest the data once all the images are uploaded to link images to occurrences

cha801p commented 2 months ago

Ticket Update: July 17, 2024 (10:30 AM)

Issue: NSW AVH - Data encoding issue

Solution: Successfully load the dataset with UTF-8 encoding

Actions Taken:

Multiple attempts were made to load the data but it failed every time
Wrote a script to read data in the appropriate encoding format
Data was read using ISO-8859-1 format and loaded on the prod
The data provider reported the inconsistency
To address the issue: data was ingestion after removing special characters using a bash script iconv -c -f utf-8 -t ascii occurrance.txt > occurrance.csv
Data provider reported - "The parenthetic authors on the example I’ve supplied should read Sessé & DC. ex Moç., so it still isn’t displaying correctly - it just looks like you’ve removed the problematic characters."
Later on data provider identified the issue with UTF-8 encoding on their side while submitting the data

Links: Metadata: data

Status: - The data provider has identified the issue with UTF-8 encoding on their end and is working on the data reformatting.

Follow-up email sent on 17th July

cha801p commented 1 month ago

Data has been reloaded and the data provider has been informed. Here is a brief conversation: https://support.ehelp.edu.au/a/tickets/205967 It has another issue linked to it which is mentioned below: https://github.com/AtlasOfLivingAustralia/data-management/issues/1072

_Note how on the following example the identifiedByIDs are linking correctly with no duplication, but the recordedByIDs are duplicated and the links for the secondary collector are also malformed:

https://avh.ala.org.au/occurrences/3d41575f-30ab-433e-a00c-eb45c95ebaf1

The problem occurs with other datasets as can be seen in this example from MEL:

https://avh.ala.org.au/occurrences/9b3081fc-cb48-44e6-a9d6-00409e6e1188_

AtlasOfLivingAustralia / data-management

NSW AVH feed - Data Refresh #1088