NHMDenmark / DanSpecify

Important files regarding the Danish instance of the Specify database system for collections digitisation and management, plus placeholder for issue tracking. Guidelines, manuals and other kinds of documentations will be gathered on the wiki.
3 stars 3 forks source link

Import Herpetology Full DB #107

Open FedorSteeman opened 2 years ago

FedorSteeman commented 2 years ago

Currently, Herpetology data is kept in a FileMaker "database". In conjuction with the NHMD portal project, a preliminary import of Herpetology data was done last year, limited to those occurences with images. This dataset was then published to GBIF here:

https://www.gbif.org/dataset/8c834f97-c5df-4280-9623-86594979f91a

However, a static publishing of Herpetology data had been done before via DanBIF here:

https://www.gbif.org/dataset/cb643105-2e6b-403d-a23b-2c8128d1f97c

The latter data set has already got 112 citations (!).

The aim is now to do a full import of all data and then switching endpoints of the latter dataset on GBIF while retiring the former. GBIF occurrence IDs on record level will need to be associated with those records imported into Specify already.

FedorSteeman commented 2 years ago

Found a workaround for adding the NHMD-prefix to the catalog number (see #101) and now everything's coming out fine on: https://www.gbif.org/dataset/8c834f97-c5df-4280-9623-86594979f91a

@markscherz I've revived this issue and will proceed with data import soon, after which I can do the GBIF switch.

FedorSteeman commented 2 years ago

Before I can attempt the import, I need to get the taxonomic backbone straightened out.

As per tip from @markscherz the ITIS.gov system received dumps from the Amphibian Species of the World database, including synonyms. This would take care of Amphibia, but we still need Reptilia.

markscherz commented 2 years ago

@FedorSteeman I have just reached out to Peter Uetz, who runs reptile-database.org. Their webiste has full database dumps until 2014 (http://www.reptile-database.org/data/), but they should be able to give us a more recent database.

markscherz commented 2 years ago

@FedorSteeman Peter Uetz confirmed the CoL has the full current taxonomy, and should also include synonyms.

FedorSteeman commented 2 years ago

Thank you @markscherz ! I've downloaded and started working with the full ITIS.gov taxonomy data set that I installed in a local database. We could also pull Reptilia from there if not CoL.

FedorSteeman commented 2 years ago

I've managed to extract an importable taxonomy from the itis.gov data set. Restricted to only valid names I come up to 8015 species. I still need to verify whether I got everything here, but in the meanwhile, @markscherz could you glance through this and tell me whether it looks all right so far?

AMPH-Taxonomy v1.csv

markscherz commented 2 years ago

@FedorSteeman Yes, it looks correct. Seems to be missing about 400 species, but this is probably because the ITIS.gov taxonomy is updated only irregularly. What will your strategy be for synonyms? For Reptilia, CoL will be better, as CoL gets a dump from reptiledatabase.

FedorSteeman commented 2 years ago

@markscherz

Seems to be missing about 400 species, but this is probably because the ITIS.gov taxonomy is updated only irregularly.

If you could mention a few of the missing species I can try to verify whether they're truly missing, or my extraction failed.

For Reptilia, CoL will be better, as CoL gets a dump from reptiledatabase.

I already imported the CoL reptilia last year (except synonyms), but only to family or genus level at most, I think.

What will your strategy be for synonyms?

I will just have to do some hacking.

markscherz commented 2 years ago

@FedorSteeman I am not sure which species are missing, I can only base it on the number of taxa in the two systems. But I think it will add up, according to the frequency of updates to the ITIS taxonomy and the number of species described per year, to be about two years out of date in terms of species number. So I expect species like Brookesia nana (2021) and maybe Calumma ratnasariae (2020) are missing?

FedorSteeman commented 2 years ago

@markscherz Thanks! Neither of these are to be found in the ITIS dataset I downloaded, so it looks like my extraction was integral.

I will try to import this taxonomic backbone for Amphibia, take a closer look at the Reptilia taxonomy already there and make a plan for adding synonyms. All will be done in the test db for now.

FedorSteeman commented 2 years ago

I've extracted 3809 synonyms for Amphibia that I plan on doing a separate "Taxon only" import with and then do a quick manual database update to mark these as synonyms. See attached file: AMPH-Synonyms v1.csv

@markscherz Interestingly, all subspecies level synonyms have been synonymized with accepted taxa on species level only. Could you glance through this and see if this makes sense?

markscherz commented 2 years ago

@FedorSteeman Yeah, the ASW database does not list subspecies separately from synonyms, so this is to be expected. I think this is fine; we should be able to override the cases where we want the subspecies recognised. This does raise the question: what is the long-term plan for retaining an up-to-date taxonomic backbone? Will there be regular updates? How will we make sure that manual changes implemented are not completely overridden every time?

FedorSteeman commented 2 years ago

There isn't really a plan for regular updates. I'm just putting a decent taxonomic backbone in place for you to work with. We could update by repeating the process. Manually added taxa will not disappear in this process, but may require some man-handling in case duplicates appear after taxon import. Any other changes may need some way to prevent to be overridden and I'm not sure how yet, although it would certainly be possible.

The question remains how important it really is to keep Specify's tax backbone precisely up-to-date with external sources. The purpose of the taxonomic backbone in Specify should be to ease data entry. It eases data entry if thousands of taxa are already present to choose and pick from, rather than having to be manually added every time.

Once published to GBIF, it will overlay the correct taxonomy for every entry, based on its sources (mainly CoL). For every determination entry it will show the original taxonomy as well as the interpreted one. This way, collection objects will eventually be matched with the current standards in taxonomy.

markscherz commented 2 years ago

@FedorSteeman okay, thanks for the clarification. That sounds fine then. :)

FedorSteeman commented 2 years ago

I have updated the occurence Ids in Specify to match those already published on. With the taxonomic backbone (ex synonyms) in place, we are ready for import of the full data set as currently recorded in FileMaker.

In the meantime, we will continue preparing Specify as per the input from last workshop, being:

  1. Solve prepType aggregation issues
  2. Investigate the possibility for certain locality fields (name, lat long) being upfront on the data entry form:
    • it may be an option to associate some fields directly with collection objects
    • Enable auto-assignment of cataloger (like for Mammalogy)
  3. Modify Data Entry Form for Herpetology:
    • Import data entry form from Mammalogy and adjust further
    • Clarify journal date vs cataloged date by adjusting captions if need be
    • Field "assign Date1" should be journal date
    • Field now called "journal date" should be re-captioned "cataloged date" being the date of "digitization" (Specify or not)
    • Add "Other identifiers" to coll. object form
    • Other Identifier "Institution" to become pick list
    • Remove "Collector Number" field
    • Add "Field Number" field
    • Collecting Info field "collectingTrip" to be captioned as "Expedition"
    • Add CollObjAttr "Size"

@AstridBVW You may want to start making adjustments to the D.E.F. (points under 3) in cooperation with daniel, since @markscherz probably will be on parental leave soon.

I will be looking into the first two main points.

markscherz commented 2 years ago

@FedorSteeman and @AstridBVW Re: Modifying Data Entry Form for herpetology, please await input from Daniel and me. We have today discussed fields we need, and he will soon forward you a list of changes that need to be made first to the mammal input form, migrated from mammals to herps, and then modified for herps.

markscherz commented 2 years ago

@FedorSteeman @AstridBVW Feature request: Daniel and I would like it if there was some way for the data entry form to flag if an Alt. Catalog Number has been entered before, because apparently hundreds of entries have accidentally been duplicated before. Is there any way to implement this?

FedorSteeman commented 2 years ago

@markscherz I can check for duplicate alt catalog numbers before import, but an automatic check during data entry for already existing ones needs a change to the source code. It's doable, but will require a separate ticket to be created her just for that.

markscherz commented 2 years ago

@FedorSteeman we should do both; I will make a separate ticket

AstridBVW commented 2 years ago

The collection object entry forms for Mammalogy and Herpetology have been modified and imported to NHMD database.

markscherz commented 1 year ago

Current status: We are cleaning some major number issues that arose during export of data from FileMaker (leading zeros were lost). Next steps:

  1. Mark to finish correcting reptile taxonomy
  2. Mark to run OpenRefine
  3. Submission to Specify Team
FedorSteeman commented 2 weeks ago

Submission received and prepared for import leading to the current state here: SNM-HerpDatabase-WithFedor.csv

After discussion with @markscherz there are a couple more things that need to be tweaked (from #293):

For 4162 records where Determiner Last Name 1 == Scherz and Determination Remarks 1 == (blank), -> move current "Taxa"1-relevant columns to new isAccepted (= is preferred?) column-set (thereby essentially noting the currently valid taxonomic names) -> move current "Taxa"2-relevant columns to replace information in "Taxa"1-relevant columns (thereby properly the 'current determination').

This is because in these cases there is not really a re-determination but rather synonymy. So these taxon updates should be imported as synonyms and not secondary determinations.

FedorSteeman commented 1 week ago

Project handed to @AstridBVW for further handling and import.

FedorSteeman commented 1 week ago

Also note that issue #247 is of relevance and a post-condition here once import has been finished.

FedorSteeman commented 1 week ago

The import is experiencing a bit of a setback since as of the current version of Specify7, Workbench does not yet support import of synonyms. See: https://github.com/specify/specify7/issues/3534

We will have to attempt to add the preferred names directly, i.e. using SQL, or at a later stage.

AstridBVW commented 1 week ago

@markscherz We have not heard anything back from Kansas about importing synonyms. Instead of waiting for them to figure this out, we will move forward with another solution that hopefully will work and spare you of doing this manually in the taxon tree. I have looked through the data and tried to map it to fields in Specify as preparation for the import (just in excel not in Specify). I have questions about some of the columns, and I think it would be easiest to meet and go through it, maybe sometime next week?

AstridBVW commented 1 week ago

@markscherz I am trying to filter out the synonym/Preferred taxon determinations in your dataset. I apply the same filters as per the instruction above, Determiner Last Name 1 == Scherz and Determination Remarks 1 == (blank), but my result is 4218 records not 4162. And I looked up the two species in the first row of the resulting records, and Full Name 2 is not the synonym for Full Name 1 (Leptopelis barbouri and Leptopelis grandiceps respectively). Did you apply a third filter to get the correct 4162 records?