Closed cha801p closed 1 month ago
Issue: Data Refresh - NSW AVH
Solution: Successfully load the dataset on biocache
Actions Taken: Data review DwCA created locally File uploaded to collectory Triggered pre-ingestion
Stats: Old occurrence count - 772,750 records Current occurrence count - 786,363 records
Status: Waiting for images to be uploaded to images.ala Re-ingest the data once all the images are uploaded to link images to occurrences
Ticket Update: July 17, 2024 (10:30 AM)
Issue: NSW AVH - Data encoding issue
Solution: Successfully load the dataset with UTF-8 encoding
Actions Taken:
Multiple attempts were made to load the data but it failed every time
Wrote a script to read data in the appropriate encoding format
Data was read using ISO-8859-1 format and loaded on the prod
The data provider reported the inconsistency
To address the issue: data was ingestion after removing special characters using a bash script iconv -c -f utf-8 -t ascii occurrance.txt > occurrance.csv
Data provider reported - "The parenthetic authors on the example I’ve supplied should read Sessé & DC. ex Moç., so it still isn’t displaying correctly - it just looks like you’ve removed the problematic characters."
Later on data provider identified the issue with UTF-8 encoding on their side while submitting the data
Status: - The data provider has identified the issue with UTF-8 encoding on their end and is working on the data reformatting.
Data has been reloaded and the data provider has been informed. Here is a brief conversation: https://support.ehelp.edu.au/a/tickets/205967 It has another issue linked to it which is mentioned below: https://github.com/AtlasOfLivingAustralia/data-management/issues/1072
_Note how on the following example the identifiedByIDs are linking correctly with no duplication, but the recordedByIDs are duplicated and the links for the secondary collector are also malformed:
https://avh.ala.org.au/occurrences/3d41575f-30ab-433e-a00c-eb45c95ebaf1
The problem occurs with other datasets as can be seen in this example from MEL:
https://avh.ala.org.au/occurrences/9b3081fc-cb48-44e6-a9d6-00409e6e1188_
Issue: Data Refresh - NSW AVH
Solution: Successfully load the dataset on biocache-test
Actions Taken:
Links: Metadata: data
Issues Encountered: The formatting of the data was odd and hence failed to read the data using UTF-8 encoding Attempted reading data using multiple encodings Experimented converting TXT file to CSV to read data - Unsuccessful SOLR_dataset_indexing FAILED with the following error: 24/07/04 02:21:12 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
_Error from server at http://aws-solr-test-2.ala:8983/solr/biocache-2024-06-04-06-20_shard2_replica_n3: ERROR: [doc=0909f47f-11d8-4bf0-ba32-e6157f790a8f] Error adding field 'elevationSource'='Collector' msg=For input string: "Collector"_
Troubleshooting: Column elevationSource had all NaN values except the last row, which had the value Collector in it. This column was deleted to eliminate the error and the issue was reported to the Systems
Logs on Test 24/07/04 01:45:22 INFO ALAUUIDMintingPipeline: Checking the percentage change in new UUIDs: 24/07/04 01:45:22 INFO ALAUUIDMintingPipeline: newUuids: 13613.0, preservedUuids: 772750.0, orphanedUniqueKeys: 1886.0 24/07/04 01:45:22 INFO ALAUUIDMintingPipeline: Percentage UUID change: 1, allowed percentage: 50, override percentage check: false