gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

Get ids present in Bionomia but missing from MUSIT #66

Closed rukayaj closed 2 years ago

rukayaj commented 2 years ago

Bjørn Petter has requested this twice now. Have sent him the latest results, but in case anyone else ever needs to do it, this is the process I use:

  1. Download frictionless data people file e.g. https://bionomia.net/dataset/7aa3a91c-eafe-44f5-adfb-d48fca1a3db5/users.csv.zip, available from hovering over the frictionless data button on https://bionomia.net/dataset/7aa3a91c-eafe-44f5-adfb-d48fca1a3db5
  2. Download corresponding dwc archive from our IPT
  3. Open dwca occurrence.txt and move recordedbyID and identifiedbyID into a single IDs column on a new spreadsheet
  4. Remove duplicates on IDs
  5. Replace | in IDs column with \n (I do this in notepad++),
  6. IDs column - change https:// to http:// (bionomia has links in http:// for some reason), and /wiki/ to /entity/ (another difference in Bionomia links)
  7. Copy the IDs column into a new sheet in the users.csv from the frictionless data file, and remove duplicates again
  8. Make a new column and enter ==VLOOKUP(F2,Sheet1!A:A,1,FALSE) on sameAs column, where Sheet1 is the IDs column sheet
  9. Filter on NA to get all the missing ids

If he asks for it again let's either make a script for it or better yet make a PR to add it to Bionomia's source code as a new download option 👍

Note there is an attributions file download option in Bionomia (frictionless data), but it just lists attributions made in Bionomia - there's no guarantee they haven't already been also added into MUSIT and GBIF.

dagendresen commented 2 years ago

IDs column - change https:// to http:// (bionomia has links in http:// for some reason), and /wiki/ to /entity/ (another difference in Bionomia links)

Wikidata QIDs (concept URI) are in http:// format -- which is the format we should be using in recordedByID and identifiedByID As an example, the full QID for NHMO is http://www.wikidata.org/entity/Q1840963 And for Johannes Lid the full QID is http://www.wikidata.org/entity/Q94522

rukayaj commented 2 years ago

Aha! Ok then I can change it from https to http when we publish the data on the IPT.