NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

Develop script to transfer AU data to Specify #516

Open PipBrewer opened 4 months ago

PipBrewer commented 4 months ago

Arhus University Herbarium currently have data that is equivalent to NHMD's digi app exports (with some differences). This needs a GREL script writing to transform it so it can be imported into Specify

beckerah commented 1 month ago

We've now decided to put together a Python script for this instead of creating a new GREL script for OpenRefine. We just got our first data dump from AU yesterday. The next step will be to go through the data and decide which fields need to be imported into Specify.

beckerah commented 1 month ago

Steps for script will be:

  1. Loop through any csv file in a folder that ends in _checked or _checked_corrected
  2. Store the filename as a variable
  3. Modify filename to end in _processed.tsv
  4. Add in appropriate columns, some adding values for all rows, others adding values based on values in other columns (see GREL steps)
  5. Rename columns
  6. Reorder columns
  7. Spit out tsv with correct name
  8. Add filename to log
  9. Move processed tsv to correct Ready for Specify folder
  10. Move original csv to correct Archive
beckerah commented 3 days ago

The following data needs to be pulled from the species-web db:

From table Specimen:

From table Folder Versions:

I've written a SQL query to pull this data and join it (currently a left join where folder versions is left). I need to add barcodes to this when we update the db.

beckerah commented 3 days ago

Additional information that needs to be added:

projectnumber: DaSSCo publish: True storedunder: True preptypename: Sheet count: 1 collection: ??? datafile_remark: [name of db export]? Possibly by date? Is this useful? datafile_source: DaSSCo data file datafile_date: [date of digitization]

❓ Questions: At this point, there doesn't seem to be a need for fields like qualifier/addendum or remarks. (Accurate statement?) What is the collection? Will there be hybrids? (These would need to be handled slightly differently because gbif does not keep hybrids in their backbone, therefore the gbif_match_json will always be null for them.) Do we need flags for new taxonomy or would that be redundant since Birgitte is already checking everything in species-web? The digitiser always shows as Birgitte right now. But should it always be Charlotte instead?

beckerah commented 3 days ago

After some additional data exploration, it looks like the fields: family, genus, species, etc do not always reflect the correct taxonomy. Therefore, it is safer to pull the taxonomy from the gbif_match_json field instead. From this field, relevant info we can pull:

beckerah commented 2 days ago

The highest classification field is also unreliable, therefore useless. I'll just have to pull the data from the scientificName field. (But if there are hybrids, they will need to be handled differently as there won't be a scientificName field for them maybe?)