Open peggynewman opened 3 years ago
Server inaccessible, AM working on it.
IMu driver is implemented. Reading other related tables to make up the final DwCA
Update 29/02/2024
Databox load:
Waiting on:
Mtg: 01/03/2024
Outstanding for AM
Outstanding for ALA
27/03/2024
New IP address whitelisted by AM To do:
28/03/2024 - Meeting
ALA
AM
In AM in test, there are 5k-ish records that don't have a AM as an institution:
In this sample record it's clear that there is no provider map for the collectionCode
value Malacology, Evolutionary Biology Unit
I think this isn't actually a collection, but Malacology is, so we probably need to either update the collectionCode in the data, or the collection name on collectory (collections list here), and maybe the provider map if there is a new collection.
12/04/2024
Status Update
Image files associated with multiple occurrences
Image file URLs treated as existing but the image is not actually on the server, hence image-service thinks it is a text file. I tested this with a python script on the 200 URLs I had and 122 of them treat the URL as valid (which it is) and return filetype of 'javascript' and filename 'request.php'. I checked these on AM and the images do not display but the record has multimedia flag set to yes. There are likely 1000s of these, probably image files still to be updated further on their side.
AMFetcher can't run on databox or prod due to image problem. I ran locally and produced the CSV file, updated it to keep only associatedMedia. Couldn't load the whole file to collectory so did 500K records. The records themselves have loaded fine.
I tested fully running the fetcher locally and when it gets to returning the occurrence file and go through to created the DWCA I get an error : _File "pandas_libs\parsers.pyx", line 1925, in pandas._libs.parsers.raise_parsererror pandas.errors.ParserError: Error tokenizing data. C error: Expected 43 fields in line 1211046, saw 75.
esites ecatalogue record:
Total number of unique images: 226676
Number unique duplicated images: 4259 (images on multiple occurrences)
Number unique occurrences associated with duplicate images: 24683
Image with Min dup count: 1000003 , Count = 2
Image with Max dup count: 1404075 , Count = 1063
17/04/2024
Images Image URLs can be valid when no image with the width specified exists. This is due to the URL being for PHP code which takes the record key and retrieves the image if it exists. This will be an ongoing problem as an empty 'text' image file is created in image service. Need to be able to delete these.
Databox
To Do
18/06/2024 Ingested all the records (without images) on databox and notified the data provider.
AM has migrated onto EmU version 6.3. Previously a job has run at their end which was written by us, which created extracts and SFTPd them to the upload server.
A new job needs to be written. Investigate the current Jenkins job and files (eg look in the raw files on the upload server) and we will work with AM to build a new extract and deliver/ETL.