Import Herpetology Full DB

FedorSteeman commented 2 years ago

Currently, Herpetology data is kept in a FileMaker "database". In conjuction with the NHMD portal project, a preliminary import of Herpetology data was done last year, limited to those occurences with images. This dataset was then published to GBIF here:

https://www.gbif.org/dataset/8c834f97-c5df-4280-9623-86594979f91a

However, a static publishing of Herpetology data had been done before via DanBIF here:

https://www.gbif.org/dataset/cb643105-2e6b-403d-a23b-2c8128d1f97c

The latter data set has already got 112 citations (!).

The aim is now to do a full import of all data and then switching endpoints of the latter dataset on GBIF while retiring the former. GBIF occurrence IDs on record level will need to be associated with those records imported into Specify already.

FedorSteeman commented 2 years ago

Found a workaround for adding the NHMD-prefix to the catalog number (see #101) and now everything's coming out fine on: https://www.gbif.org/dataset/8c834f97-c5df-4280-9623-86594979f91a

@markscherz I've revived this issue and will proceed with data import soon, after which I can do the GBIF switch.

FedorSteeman commented 2 years ago

Before I can attempt the import, I need to get the taxonomic backbone straightened out.

As per tip from @markscherz the ITIS.gov system received dumps from the Amphibian Species of the World database, including synonyms. This would take care of Amphibia, but we still need Reptilia.

markscherz commented 2 years ago

@FedorSteeman I have just reached out to Peter Uetz, who runs reptile-database.org. Their webiste has full database dumps until 2014 (http://www.reptile-database.org/data/), but they should be able to give us a more recent database.

markscherz commented 2 years ago

@FedorSteeman Peter Uetz confirmed the CoL has the full current taxonomy, and should also include synonyms.

FedorSteeman commented 2 years ago

Thank you @markscherz ! I've downloaded and started working with the full ITIS.gov taxonomy data set that I installed in a local database. We could also pull Reptilia from there if not CoL.

FedorSteeman commented 2 years ago

I've managed to extract an importable taxonomy from the itis.gov data set. Restricted to only valid names I come up to 8015 species. I still need to verify whether I got everything here, but in the meanwhile, @markscherz could you glance through this and tell me whether it looks all right so far?

AMPH-Taxonomy v1.csv

markscherz commented 2 years ago

@FedorSteeman Yes, it looks correct. Seems to be missing about 400 species, but this is probably because the ITIS.gov taxonomy is updated only irregularly. What will your strategy be for synonyms? For Reptilia, CoL will be better, as CoL gets a dump from reptiledatabase.

FedorSteeman commented 2 years ago

@markscherz

Seems to be missing about 400 species, but this is probably because the ITIS.gov taxonomy is updated only irregularly.

If you could mention a few of the missing species I can try to verify whether they're truly missing, or my extraction failed.

For Reptilia, CoL will be better, as CoL gets a dump from reptiledatabase.

I already imported the CoL reptilia last year (except synonyms), but only to family or genus level at most, I think.

What will your strategy be for synonyms?

I will just have to do some hacking.

markscherz commented 2 years ago

@FedorSteeman I am not sure which species are missing, I can only base it on the number of taxa in the two systems. But I think it will add up, according to the frequency of updates to the ITIS taxonomy and the number of species described per year, to be about two years out of date in terms of species number. So I expect species like Brookesia nana (2021) and maybe Calumma ratnasariae (2020) are missing?

FedorSteeman commented 2 years ago

@markscherz Thanks! Neither of these are to be found in the ITIS dataset I downloaded, so it looks like my extraction was integral.

I will try to import this taxonomic backbone for Amphibia, take a closer look at the Reptilia taxonomy already there and make a plan for adding synonyms. All will be done in the test db for now.

FedorSteeman commented 2 years ago

I've extracted 3809 synonyms for Amphibia that I plan on doing a separate "Taxon only" import with and then do a quick manual database update to mark these as synonyms. See attached file: AMPH-Synonyms v1.csv

@markscherz Interestingly, all subspecies level synonyms have been synonymized with accepted taxa on species level only. Could you glance through this and see if this makes sense?

markscherz commented 2 years ago

@FedorSteeman Yeah, the ASW database does not list subspecies separately from synonyms, so this is to be expected. I think this is fine; we should be able to override the cases where we want the subspecies recognised. This does raise the question: what is the long-term plan for retaining an up-to-date taxonomic backbone? Will there be regular updates? How will we make sure that manual changes implemented are not completely overridden every time?

FedorSteeman commented 2 years ago

There isn't really a plan for regular updates. I'm just putting a decent taxonomic backbone in place for you to work with. We could update by repeating the process. Manually added taxa will not disappear in this process, but may require some man-handling in case duplicates appear after taxon import. Any other changes may need some way to prevent to be overridden and I'm not sure how yet, although it would certainly be possible.

The question remains how important it really is to keep Specify's tax backbone precisely up-to-date with external sources. The purpose of the taxonomic backbone in Specify should be to ease data entry. It eases data entry if thousands of taxa are already present to choose and pick from, rather than having to be manually added every time.

Once published to GBIF, it will overlay the correct taxonomy for every entry, based on its sources (mainly CoL). For every determination entry it will show the original taxonomy as well as the interpreted one. This way, collection objects will eventually be matched with the current standards in taxonomy.

markscherz commented 2 years ago

@FedorSteeman okay, thanks for the clarification. That sounds fine then. :)

FedorSteeman commented 2 years ago

I have updated the occurence Ids in Specify to match those already published on. With the taxonomic backbone (ex synonyms) in place, we are ready for import of the full data set as currently recorded in FileMaker.

In the meantime, we will continue preparing Specify as per the input from last workshop, being:

Solve prepType aggregation issues
Investigate the possibility for certain locality fields (name, lat long) being upfront on the data entry form:
- it may be an option to associate some fields directly with collection objects
- Enable auto-assignment of cataloger (like for Mammalogy)
Modify Data Entry Form for Herpetology:
- Import data entry form from Mammalogy and adjust further
- Clarify journal date vs cataloged date by adjusting captions if need be
- Field "assign Date1" should be journal date
- Field now called "journal date" should be re-captioned "cataloged date" being the date of "digitization" (Specify or not)
- Add "Other identifiers" to coll. object form
- Other Identifier "Institution" to become pick list
- Remove "Collector Number" field
- Add "Field Number" field
- Collecting Info field "collectingTrip" to be captioned as "Expedition"
- Add CollObjAttr "Size"

@AstridBVW You may want to start making adjustments to the D.E.F. (points under 3) in cooperation with daniel, since @markscherz probably will be on parental leave soon.

I will be looking into the first two main points.

markscherz commented 2 years ago

@FedorSteeman and @AstridBVW Re: Modifying Data Entry Form for herpetology, please await input from Daniel and me. We have today discussed fields we need, and he will soon forward you a list of changes that need to be made first to the mammal input form, migrated from mammals to herps, and then modified for herps.

markscherz commented 2 years ago

@FedorSteeman @AstridBVW Feature request: Daniel and I would like it if there was some way for the data entry form to flag if an Alt. Catalog Number has been entered before, because apparently hundreds of entries have accidentally been duplicated before. Is there any way to implement this?

FedorSteeman commented 2 years ago

@markscherz I can check for duplicate alt catalog numbers before import, but an automatic check during data entry for already existing ones needs a change to the source code. It's doable, but will require a separate ticket to be created her just for that.

markscherz commented 2 years ago

@FedorSteeman we should do both; I will make a separate ticket

AstridBVW commented 2 years ago

The collection object entry forms for Mammalogy and Herpetology have been modified and imported to NHMD database.

markscherz commented 1 year ago

Current status: We are cleaning some major number issues that arose during export of data from FileMaker (leading zeros were lost). Next steps:

Mark to finish correcting reptile taxonomy
Mark to run OpenRefine
Submission to Specify Team

FedorSteeman commented 3 months ago

Submission received and prepared for import leading to the current state here: SNM-HerpDatabase-WithFedor.csv

After discussion with @markscherz there are a couple more things that need to be tweaked (from #293):

For 4162 records where Determiner Last Name 1 == Scherz and Determination Remarks 1 == (blank), -> move current "Taxa"1-relevant columns to new isAccepted (= is preferred?) column-set (thereby essentially noting the currently valid taxonomic names) -> move current "Taxa"2-relevant columns to replace information in "Taxa"1-relevant columns (thereby properly the 'current determination').

This is because in these cases there is not really a re-determination but rather synonymy. So these taxon updates should be imported as synonyms and not secondary determinations.

FedorSteeman commented 2 months ago

Project handed to @AstridBVW for further handling and import.

FedorSteeman commented 2 months ago

Also note that issue #247 is of relevance and a post-condition here once import has been finished.

FedorSteeman commented 2 months ago

The import is experiencing a bit of a setback since as of the current version of Specify7, Workbench does not yet support import of synonyms. See: https://github.com/specify/specify7/issues/3534

We will have to attempt to add the preferred names directly, i.e. using SQL, or at a later stage.

AstridBVW commented 2 months ago

@markscherz We have not heard anything back from Kansas about importing synonyms. Instead of waiting for them to figure this out, we will move forward with another solution that hopefully will work and spare you of doing this manually in the taxon tree. I have looked through the data and tried to map it to fields in Specify as preparation for the import (just in excel not in Specify). I have questions about some of the columns, and I think it would be easiest to meet and go through it, maybe sometime next week?

AstridBVW commented 2 months ago

@markscherz I am trying to filter out the synonym/Preferred taxon determinations in your dataset. I apply the same filters as per the instruction above, Determiner Last Name 1 == Scherz and Determination Remarks 1 == (blank), but my result is 4218 records not 4162. And I looked up the two species in the first row of the resulting records, and Full Name 2 is not the synonym for Full Name 1 (Leptopelis barbouri and Leptopelis grandiceps respectively). Did you apply a third filter to get the correct 4162 records?

AstridBVW commented 1 month ago

I had a meeting with Mark 20240909 where we went through the questions I had regarding the data. I got input from Mark and since also some input from Daniel (about Cataloger/Agent), and I am continuing with the preparation of the data for import.

One of the things we talked about was the titles/job titles assigned to some of the agents in the dataset. Tina has been working on standardising this in Specify so I had her take a look at the data. Currently the data does not conform to the standards that she is trying to implement, and it is actually very messy. She (and I) would prefer it to be cleaner but it would take too much work, so it will have to be cleaned after import.

We also talked a lot about the taxonomy. We have looked into how to get the synonym/preferred taxon relationships imported to Specify. What we found is that it is not possible at the moment to import this to Specify through the workbench. Others have had this same problem, and we found this solution on the Specify discourse forum: https://discourse.specifysoftware.org/t/establishing-relationship-between-synonym-and-preferred-accepted-taxon-en-masse/1663

Fedor has tested it, and it works. Because the current taxon tree in Specify is very messy with a lot of duplicates, we have decided that the best solution is to start over from scratch with the Herpetology collection in Specify. This means exporting all the records currently in Specify (including their NHMD numbers and GUIDs to retain the link to the matching occurrences in GBIF) and then purging the collection. We will then reimport the records that were exported and import the data from filemaker (excluding any duplicates that have already been imported previously, i.e. those with images). The taxon tree will be build based on the taxons associated with the imported records. As a final step, we will add the synonym/preferred taxon relationships to the taxon tree based on the method from the link above.

My hope is to resolve any issues before Mark leaves for fieldwork, and then hopefully go ahead with the export, purge and reimport/import while Mark is away. If any crucial issues pop up after Mark has left, we might decide to push it until he is back in January.

I have made a mock up of the mapping of the data fields to fields in Specify (file below). @markscherz , please take a look at this and see if you approve.

Specify_mapping.xlsx

markscherz commented 1 month ago

Specify_mapping_MDS.xlsx

Comments added here.

AstridBVW commented 1 month ago

@markscherz Thank you for the quick reply! Here is my answers to your questions:

Verbatim Reg Date: As far as I can see, this date has been used to populate the mandatory Cataloged Date field, but changed to the correct format. But if you want to retain the date in its original format, we can find a place for it in Specify.

Digitisation Date: For the Digitisation Date values that are a specific date and time, the Verbatim Reg Date is blank, and the Digitisation Date is instead used to populate the Cataloged Date. The timestamp is lost but we can find a field for it if you want to retain it in its current format. The rest just says "Pre-2015". Is this information important to you? The Verbatim Reg Date for these are all before 2015 (which is retained in the Cataloged Date).

DNAsample: Yes, this information is retained in PrepType 3. The full value is retained in the PrepDescription 3, and if there is a date it is also retained in Prep Date 3.

markscherz commented 3 weeks ago

@AstridBVW Okay all sounds fine. Can just drop Verbatim Reg Date and Digitisation date, I think. Great about PrepType 3. Green light!

AstridBVW commented 3 weeks ago

Questions/info for @markscherz :

Cataloged Date Cataloged Date is mandatory for all records. It has been populated with Verbatim Reg Date or Digitisation Date for most records. For 2855 records Cataloged Date is blank, what should we populate it with? For these records there are no info for Verbatim Reg Date or Digitisation Date (only pre-2015).

Cataloger/Agent There is one record where Agent is blank so I cannot populate Cataloger columns which are mandatory. What should I put instead? I could put you or Daniel.

Re-splitting Elevation I found a hick-up regarding Elevation. The plan was to re-split the values in the Verbatim Elevation column into Elevation, MinElevation and MaxElevation depending on whether or not the value is an integer or interval. However, unfortunately there is no Elevation field in Specify, what you have now on your form is actually the minElevation field with the caption changed to “Elevation”. What should we do? We can change the caption back to “Min Elevation”, and then you will continue to use that field if you only have an integer for elevation, and if you have an interval, you can use both Min Elevation and Max Elevation (I will add maxElevation to your UI).

Count The data was missing count columns for every PrepType column (they belong together like a pair). I have added these and set a default value of 1. If the number need to be higher, you can always manually edit this for specific records after the import to Specify.

Taxonomy We are going to build the new taxon tree in Specify based on the taxonomy in your dataset, hence it will include all the synonym/preferred relationships that you have added to the data. We will build the new tree first, and then import the specimen data. Since the new tree will have all the synonym/preferred relationships, the imported specimen records will show the correct associations in the Taxon and Preferred Taxon fields in the Determination table when the Taxon is a synonym. So there is no longer a need for the “determinations” you added to the dataset for all the preferred taxons. Thus, I am going to remove them from the dataset, and I would just like to confirm the details before going ahead with this:

As far as I understand, you added in the preferred taxons as determinations, and they are all placed as Determination 1. They can be filtered out by setting the Determiner Det1 as you, and Determination Remarks as blank. For all of these, I am going to overwrite the Determination 1 info (the preferred taxon) with the Determination 2 info (the synonym). Is there any of the determination info you added in with the preferred taxon “determinations” that should be saved? Or was the point of the info only to be able to filter these out and establish synonym/preferred relationships during import?

Also, we would like to import this while you are away. If we come across any issues where a decision has to be made, would it be ok for us to rely on Daniel to make the decision on your behalf? Or would you rather that we contact you?

markscherz commented 3 weeks ago

• Catalogue Date This is a bit problematic because the two things that are being mapped to that field are not the same; digitisation date is often not the same date as the catalogue date, and most specimens have a catalogue date that is different than the digitisation date. The only exceptions should be specimens that were either catalogued on paper and in the database at the same time, or those that were only ever catalogued digitally. From speaking with @FedorSteeman, it sounds like the catalogue date should reflect the date the specimen received its first catalogue number, so this calls into question all of those where there is a digitisation date but no original registration date, and this should be carefully considered. The places where both pieces of information are missing are actually just incomplete records, which we need to be able to find conveniently and directly. So, I would recommend giving them an unrealistic date (is 0000-00-00 possible?) so that we can easily find and fix them later. For example: Afrixalus clarkeorum paratype specimens 079670, 079671, 079672, 079673, 079674 are given in the database with no collection date and no catalogue/registration date. However, consulting the paper catalogue unambiguously gives the registration date as 22 April 1978 and the collection date as 13 April 1972.

So, we need to be able to find these conveniently in any case, so that we can solve the problems one by one. I am very open to suggestions.

• Cataloger/Agent Just put Daniel. I have not put anything into this database, so almost certainly it was done by Daniel.

• Re-splitting Elevation Fine

• Count Fine

• Taxonomy Your solution sounds like it will probably work. Here is a file that contains the 'genuine' re-derminations by me that should be retained with me as the re-determiner, because these are probably all not synonyms but re-identification of the material: Genuine-Scherz-Determinations.xlsx

Yes, you can rely on Daniel to make decisions on my behalf while I am inaccessible. You can try to text/whatsapp me; I will send my local number as soon as I have one. I might have occasional signal

AstridBVW commented 2 weeks ago

@markscherz You will probably not see this for a while but here are my solutions to the problems you are listing for Cataloged Date:

Cataloged Date OK, then I think we should include the Verbatim Reg Date and the Digitisation Date in the data that is imported. We will keep the Cataloged Date as it is now (macthing either of the two dates if given) but I will find fields in Specify to map them to in case you need them at a later stage after the import (for queries and data cleaning). About the incomplete records, I don’t think it will work using the Cataloged Date to tag these for fixing. But I will find another way to tag these so you can easily find them later, I have a couple of ideas for this.

markscherz commented 2 weeks ago

Sounds great, thanks Astrid.

- - -Dr Mark D. Scherz, PhD (Dr rer. nat.) *Curator of Herpetology & Assistant Professor of Vertebrate Zoology Natural History Museum of Denmark, University of Copenhagen IUCN SSC Amphibian Specialist Group IUCN SSC Chameleon Specialist Group @.> Skype: mark.scherz @MarkScherz @.***> www.markscherz.com www.squamatespod.com

On Wed 6. Nov 2024 at 13:29, Astrid Blok van Witteloostuijn < @.***> wrote:

@markscherz https://github.com/markscherz You will probably not see this for a while but here are my solutions to the problems you are listing for Cataloged Date:

Cataloged Date OK, then I think we should include the Verbatim Reg Date and the Digitisation Date in the data that is imported. We will keep the Cataloged Date as it is now (macthing either of the two dates if given) but I will find fields in Specify to map them to in case you need them at a later stage after the import (for queries and data cleaning). About the incomplete records, I don’t think it will work using the Cataloged Date to tag these for fixing. But I will find another way to tag these so you can easily find them later, I have a couple of ideas for this.

— Reply to this email directly, view it on GitHub https://github.com/NHMDenmark/DanSpecify/issues/107#issuecomment-2459281048, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALF32NBI5GLOFM6WFAIIOVDZ7HVO3AVCNFSM6AAAAABNDSE5E6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJZGI4DCMBUHA . You are receiving this because you were mentioned.Message ID: @.***>

NHMDenmark / DanSpecify

Import Herpetology Full DB #107