Develop script to transfer AU data to Specify

NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team

Apache License 2.0

1 stars 0 forks source link

Develop script to transfer AU data to Specify #516

Open PipBrewer opened 4 months ago

PipBrewer commented 4 months ago

Arhus University Herbarium currently have data that is equivalent to NHMD's digi app exports (with some differences). This needs a GREL script writing to transform it so it can be imported into Specify

beckerah commented 1 month ago

We've now decided to put together a Python script for this instead of creating a new GREL script for OpenRefine. We just got our first data dump from AU yesterday. The next step will be to go through the data and decide which fields need to be imported into Specify.

beckerah commented 1 month ago

Steps for script will be:

Loop through any csv file in a folder that ends in _checked or _checked_corrected
Store the filename as a variable
Modify filename to end in _processed.tsv
Add in appropriate columns, some adding values for all rows, others adding values based on values in other columns (see GREL steps)
Rename columns
Reorder columns
Spit out tsv with correct name
Add filename to log
Move processed tsv to correct Ready for Specify folder
Move original csv to correct Archive

beckerah commented 3 days ago

The following data needs to be pulled from the species-web db:

From table Specimen:

id
barcode
guid
digitiser
date_asset_taken (catalogued date)
folder id (to match with Folder Versions table)

From table Folder Versions:

id (to match with Specimen table)
folder id
area (broad geographic region)
family
genus
species
variety
subspecies
highest classification (actually lowest classification)
gbif match json (includes full taxonomy and author info; will be null if nothing was found and there should be a row with duplicate folder id containing correct info)

I've written a SQL query to pull this data and join it (currently a left join where folder versions is left). I need to add barcodes to this when we update the db.

beckerah commented 3 days ago

Additional information that needs to be added:

projectnumber: DaSSCo publish: True storedunder: True preptypename: Sheet count: 1 collection: ??? datafile_remark: [name of db export]? Possibly by date? Is this useful? datafile_source: DaSSCo data file datafile_date: [date of digitization]

❓ Questions: At this point, there doesn't seem to be a need for fields like qualifier/addendum or remarks. (Accurate statement?) What is the collection? Will there be hybrids? (These would need to be handled slightly differently because gbif does not keep hybrids in their backbone, therefore the gbif_match_json will always be null for them.) Do we need flags for new taxonomy or would that be redundant since Birgitte is already checking everything in species-web? The digitiser always shows as Birgitte right now. But should it always be Charlotte instead?

beckerah commented 3 days ago

After some additional data exploration, it looks like the fields: family, genus, species, etc do not always reflect the correct taxonomy. Therefore, it is safer to pull the taxonomy from the gbif_match_json field instead. From this field, relevant info we can pull:

taxonID (Do we want to retain taxonID (example: "gbif:2970468), or separate taxon source ("gbif") and id ("2970468")?)
kingdom (needed?)
phylum (needed?)
order (needed?)
family
genus
species
variety
subspecies
scientificName (the full name, including author)
canonicalName (the full name, excluding author)
authorship (just the author by itself)

beckerah commented 2 days ago

The highest classification field is also unreliable, therefore useless. I'll just have to pull the data from the scientificName field. (But if there are hybrids, they will need to be handled differently as there won't be a scientificName field for them maybe?)