AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

Merge Data Load : QVMAG dr345 #983

Closed cha801p closed 3 months ago

cha801p commented 1 year ago

Metadata

Data Prep

Data Load

cha801p commented 1 year ago

Ticket Update: October 3, 2023 (4:30 PM)

Issue: Data Refresh for QVMAG dr345 - Full Data Load

Solution: Successfully loaded the new dataset into biocache.

Actions Taken:

  1. Reviewed the data thoroughly and acquired multimedia.csv and occurrence.csv files.
  2. Loaded data to databox without images.
  3. Verified the UUID count
  4. Ran SOLR_dataset_indexing to test data on databox
  5. Data loaded successfully on data box with count change from 112,885 records to 116,027 records

Next Step

Questions: The data provider has provided a multimedia file containing links to images with the new data load. In the previous data load, we received the RAW images from the data provider and uploaded these images on images.ala to generate links to associate occurrences with images. These new URLs directly point to images. Should we inquire with the data provider if this will be the case for future loads and decide on one of the following approaches:

  1. Delete the current images on images.ala and then reload the multimedia.csv.
  2. Simply load the data without uploading multimedia.csv since there are no new images in this data load.
cha801p commented 1 year ago

Answer: In response to the previous question, I had a conversation with Peegy and Mahmoud this morning regarding images. After our discussion, we reached a consensus that the solution involves loading data along with new links into the multimedia.csv file. Reason - The system generates a hash value for images, which means that new images will be downloaded, and new links should be updated for these images.

Ticket Update: October 5, 2023 (5:30 PM)

Issue: Data Refresh for QVMAG dr345 - Full Data Load

Solution: Successfully loaded the new dataset into biocache.

Actions Taken:

  1. Data provider has sent occurrence.csv with (116,027 records) and multimedia.csv with (3,544 records).
  2. Manually created a DWcA and uploaded it on collectory.
  3. Ran Preingestion - FAILED - Failed as DWcA was not properly formed. (xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 56, column 1 Command exiting with ret '1')
  4. To create a DWcA locally added institution code and collectionCode to the multimedia.csv (csv_to_dwca.py code fails otherwise).
  5. Locally created a DWcA and uploaded it to the collectory.
  6. Successfully ran preingestion with the following parameters: { "datasetIds": "dr345", "load_images": "true", "instanceType": "m6g.xlarge", "extra_args": "{}", "override_uuid_percentage_check": "false" }

Logs: UUID- 23/10/04 05:48:08 INFO ALAUUIDMintingPipeline: Checking the percentage change in new UUIDs: 23/10/04 05:48:08 INFO ALAUUIDMintingPipeline: newUuids: 3142.0, preservedUuids: 112885.0, orphanedUniqueKeys: 55.0

images/batchUpload- https://images.ala.org.au/admin/batchUpload/397069363 total: 48, completed: 48, loading: 0, queued: 0, stopped: 0

Validation: