gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

Check the first dataset from NaturRestaurering #7

Closed rukayaj closed 3 years ago

rukayaj commented 3 years ago

https://ipt.gbif.no/manage/resource.do?r=occurence_small_projects

dagendresen commented 3 years ago

occurrenceID Forslår å legge inn prefix urn:uuid:[UUID] foran UUIDen. Det er også mulig å benytte (stole på) GBIF-nodens resolver, https://purl.org/gbifnorway/id/[UUID]. En naken UUID er også en bra persistent identifikator, men den blir enda bedre maskinlesbar om den prefikses som en URN eller en PURL. Hvilken du velger er en smakssak, men PURL fungerer best akkurat nå, mens URN kanskje (!!) fungerer best i en fremtid med en universell modell for å resolvere URN... c31fc61e-ef26-11e9-9c88-891c2040a2f0 --> urn:uuid:c31fc61e-ef26-11e9-9c88-891c2040a2f0 https://purl.org/gbifnorway/id/c31fc61e-ef26-11e9-9c88-891c2040a2f0

recordedBy + recordedByID Anbefaler å legge inn maskinlesbar ID for innsamler. Slik blir det enklere for maskiner å gi deg siteringspoeng i den nye modellen for evaluering av forskere (som kommer hurtig). (Her hvor alle artsdata er innsamlet av deg har jeg lagt inn en hardkodet referanse, men i datasett med flere innsamlere er det lurt å legge inn per datapost). recordedBy = Jørn Olav Løkken --> recordedByID = https://orcid.org/0000-0003-1024-0406

identifiedBy --> identifiedByID Her hvor det er flere personer som har identifisert artsdata-postene, bør identifiedByID gjerne legges inn per datapost. Ved flere personer kan flere ORCID oppgis pipe separert. Anders Often = ? Kåre Arnstein Lye = https://orcid.org/0000-0003-0398-890X (?)

verbatimLocality --> locality + locationID + country Jeg ser du benytter verbatimLocality som vanligvis benyttes for locality-beskrivelse akkurat slik som den står på en etikett for et samlingsobjekt eller i felt-dagboken. Når du legger inn det du mener er riktig lokalitet tror jeg at jeg anbefaler å bare legge dette inn i feltet "locality" direkte? For at lokalitet skal blir maskinlesbar er det også veldig nyttig å gi en maskinlesbar referanse, f.eks. fra geonames. Ihvertfall om ingen maskinlesbar lokalitetsbeskrivelse inkluderes er det lurt å oppgi ihvertfall land (gjerne eventuelt kommune og fylke) fordi mange ulike steder har samme navn :-) Jeg vet dette kan avledes fra georeferansene, men litt mere data gir mat for datakvalitetsrutiner som gir høyere tillit til georeferansene (som det dessverre ofte er litt problemer med).

Jeg fant ikke alle lokale steder i Geonames, og har her lagt inn tettsted/by isteden. Det er mulig å legge til lokale steder i Geonames selv. Jeg har lagt til noen av stedene - og den maskinlesbare referansen virker ikke riktig ennå (kanskje om noen timer?).

Ås VGs. = https://sws.geonames.org/3162672/ Bergen rådhus = https://sws.geonames.org/12216987/ Gardermoen alle = https://sws.geonames.org/3150851/ (Jessheim) Gardermoen næringspark = https://sws.geonames.org/3150851/ (Jessheim) Hoveodden = https://sws.geonames.org/3151635/ Lysaker Møllefossen = https://sws.geonames.org/12216988/ Politihuset Trondheim = https://sws.geonames.org/12216989/ Verdal VGS = https://sws.geonames.org/12216986/ Vogellund = https://sws.geonames.org/8299525/ (Nesbru)

rukayaj commented 3 years ago

1) the habitat strings are in Norwegian - if it's easy for him to change it to English that would be great, but if there are many variations then I think it's more important to just publish the data 2) it looks like it might be better as an event dataset - but maybe you and Dag already spoke to him about this in the meeting last week? 3) It might be nice if there was a bit more info in the metadata too about the data collection methods, if he knows how collection normally happens. For eksempel en fyldigere beskrivelse og metoder. Beskriv gjerne bakgrunnen for prosjektet. Da vil det være enklere for andre å forstå bakgrunnen til at dataene er innsamlet. Det kan også nevnes i teksten hvor i Norge dataene er fra. Du kan også angi det geografiske området på kartet i metadataene. Jeg kan eventuelt også gjøre dette så snart jeg ser hvor dataene er innsamlet. Jo mer metadata som kan legges inn, jo bedre er det for forståelsen og kvaliteten av dette datasettet.

​Skulle dette vært registrert som et eget datasett, eller hører dette sammen med det fra 2019?

dagendresen commented 3 years ago
  1. agree that English metadata is much better, but also that just get it out, and then maybe improve with English translations later. (maybe English translations could be a citizen science data annotation topic for the "Digital Specimen" :-))
  2. Jørn preferred to make more effort into the next upcoming datasets. This one was just data "lying around" that he wanted to get out without putting toooo much efforts.... So he preferred to keep this occurrence core and make the next ones event core... But maybe we want to assist in "upgrading this dataset also?)
  3. Maybe more info on sampling methods could also be added later -- when Jørn has more experience with howto write this after adding the datasets he cares more about...
rukayaj commented 3 years ago

Great, I think we can close this as it's now published: https://doi.org/10.15468/cuocad

dagendresen commented 3 years ago

I suggest adding locationID? Will have a look.

dagendresen commented 3 years ago

I added translation for the verbatimLocality to locationID -- BUT now it looks like IPT also translate the values for the locality/verbatiumLocality mapping...???!!

Screenshot 2021-02-26 at 10 44 53 Screenshot 2021-02-26 at 10 45 57
dagendresen commented 3 years ago

And publishing fails...? Error on the uniqueness of occurrenceIDs

Publishing version #1.4 of resource occurence_small_projects failed: Archive generation for resource occurence_small_projects failed: Can't validate DwC-A for resource occurence_small_projects. Each line must have a occurrenceID, and each occurrenceID must be unique (please note comparisons are case insensitive)

dagendresen commented 3 years ago

Did we archive a copy of the raw data in Zenodo?

rukayaj commented 3 years ago

I didn't, but maybe Vidar did?

dagendresen commented 3 years ago

I dont have the raw file, can we "extract" it from the IPT?

rukayaj commented 3 years ago

[Uploading od_small_projects_241019.txt…]()

dagendresen commented 3 years ago

We might map the folder of the resources to an URL ...? And then use this URL to fetch raw files...? E.g. https://ipt.gbif.no/resources/occurence_small_projects/sources/od_small_projects_241019.txt

Screenshot 2021-02-26 at 11 05 52
dagendresen commented 3 years ago

Looks like vartax_2021 is the NEW file that Jørn added, and od_small_projects_241019 is an old datafile. And that the vartax_2021 file is mapped identically twice! Thus duplicate occurrenceIDs....

Screenshot 2021-02-26 at 11 13 57
rukayaj commented 3 years ago

vartax_2021.txt I am hoping the next version of the IPT will include one of the patches which allows the download of files from the admin interface...

Hmm I wonder how it got published successfully then, if both files were mapped.

dagendresen commented 3 years ago

Leif Ryvarden = https://orcid.org/0000-0002-4670-7306 http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0102-33062019000100163&tlng=en https://bionomia.net/help-others/0000-0002-4670-7306

dagendresen commented 3 years ago

It looks as if the data records I get when fetching the DwC-A do NOT include recordedByID nor identifiedByID nor locationID when these are only mapped in ONE of the two data files joined vertically...?

I hope will work decently if f I add the recordedByID identifiedByID locationID to the data-records fetched from the DwC-A... :-)

rukayaj commented 3 years ago

It looks like this DWCA https://ipt.gbif.no/archive.do?r=occurence_small_projects&v=1.5 does contain recordedByID. From what I can see, identifiedByID isn't a mapped column, and I don't see the locationID translation ?

dagendresen commented 3 years ago

Maybe just me thinking there would be more there -- and silly me, I believe it was actually me deleting the locationID mapping just before generating the DwC-A :-) a bit too much parallell tasks.

dagendresen commented 3 years ago

I added the "new" source data file at https://zenodo.org/record/4564455#.YDjwU11Kjzc And added the URI to the dataset EMAL metadata in the IPT

dagendresen commented 3 years ago

One last thing:

I added in the "new" source data file an alternative "occurrenceID_urn" column with "urn:uuid:" prefix. However, eventually updating the occurrenceIDs would likely BREAK the occurrenceKey continuation...? So I did maintained the mapping of the old naked UUIDs as occurrenceID.

Should we keep these or remap to the urn.uuid perfixed ones?

One last bonus thing:

I located the geonames locationID for the localities in the previous (2019) od_small_projects_241019 [file], but did not start on doing the same for the localities in the new (2021) vartax_2021 [file] localities...

rukayaj commented 3 years ago

It was published so recently I wonder if it would matter so much, to add the "urn:uuid" prefix?

dagendresen commented 3 years ago

I guess the "urn:uuid:" prefix in practice makes no difference today... but in a bright future sometime maybe the world cares enough to be able to resolve urn:uuid: preefixed URNs ... :-) In many ways I personally like them much more than the HTTP and HTTPS things. But I am fine with closing this issue now, unless you want to add some other things.

rukayaj commented 3 years ago

No I can't think of anything else, let's close it for now.

dagendresen commented 3 years ago

Republished with new data files from Jørn

Updated the datafiles in Zenodo, new link https://zenodo.org/record/4564455#.YECVF11Kjzc

Changes made by Jørn

Fant også at det var en feil i latinsk navn på en art i VarTax-filen som jeg har rettet: Latinsk navn på Mjødurt Filipendula ulmaria hadde blitt til Filipendula vulgaris -> knollmjødurt, noe som er en litt uheldig feil siden sistnevnte er rødlistet.. Legger også ved original-filene så kan dere legge dem opp på Zendo.

rukayaj commented 3 years ago

Reopening because Jørn sent another follow up, and it looks like there are still records with Filipendula vulgaris which I think might be wrong?

rukayaj commented 3 years ago

Jørn asked me to publish this again and he says everything is now ok, apparently the remaining records with Filipendula vulgaris were correct.