gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

Dataset from geco group (Eva) - Presence-absence of plant habitat specialists in 15 patches #87

Closed rukayaj closed 1 year ago

rukayaj commented 2 years ago

Eva is making a data paper 🥳 She wants to get started on the metadata, so I have made her an account on the IPT and she has started filling it in: https://ipt.gbif.no/resource/preview?r=geco-plant-habitat-specialists-15-patches

Thinking ahead towards publishing the data - one thing I noticed is that it is in cross tab format, we will need to upload it in a normal list format, and we need to have the sampling event in a separate sheet. So there needs to be one row for each occurrence, like this:

Occurrence file: occurrenceID eventID scientificName occurrenceStatus
1 p1-2012 Acinos arvensis present
2 p1-2012 Androsace septentrionalis present
... ... ... ...
n - 1 p15-2020 Veronica spicata absent
n p15-2020 Woodsia alpina absent

We might also add individualCount as 0 for the absence records. Actually we usually just publish the presence occurrences and in the metadata we put a list of species so that the absences can be inferred. But I've been wondering lately if that's really the best call now more people have started publishing absence data on GBIF. I think I saw something about on one of the GBIF github issues. Thoughts @dagendresen @vidarbakken ?

Anyway, each of these occurrences would need to be related to an event via the eventID. So we would have a separate event file, looking something like this, with each collection (at a certain patch, in a certain year) as a separate event:

Event file: eventID year decimalLatitude decimalLongitude coordinateUncertaintyInMeters
p1-2012 2012 1.111 2.222 100
... ... ... ... ...
p15-2020 2020 1.112 2.223 100

@evalieungh: I can do this data conversion for you, but maybe you have it in list format already for the data analysis? We usually use uuids for the ID columns but for this example I've kept it simple so it's easier to see how they relate to each other.

Do you know what day + month the observations were recorded as well? And who was doing the fieldwork each year?

evalieungh commented 2 years ago

I'll see if I can change the format using R tidyverse stuff. And I agree it would be nice to somehow mark that these are supposed to be true absences -- I've spent hours looking for the species so the absence record is as good as it gets I think.

It might be difficult to find the dates for each polygon, but months should be OK to add.

btw, @ eva might be confused by this. Maybe remove the tag, hehe.

vidarbakken commented 2 years ago

When I used absent data for some datasets from NMBU i 2019 we got some problems. The absent data was visible in Artskart, and in that time number of individuals were not given. This is now changed and now number is visible in Artskart for each record. But it is still problematic if absent data is imported with 0 because few people will take notice of a zero observation. A point on the map will be interpreted as an observation of that species. We must check how absent data is treated in Artskart. The best would be that they are not visible in Artskart at all.

evalieungh commented 2 years ago

It's still missing the month, but I uploaded a new .csv in the list format. I'm not sure how the coordinate data should be added - the easiest may be to use the polygon center coordinates, but I would like to add at least a map showing the polygons (1-15) that were sampled.

dagendresen commented 2 years ago

Luke asked a similar question about crosstab transformation for Nansen Legacy datasets in our GitHub issue for planning the marine data workshop in Tromsø. I believe that already many use the tidyverse R tools for this purpose.

It is possible to actually enumerate all taxon species names that have been recorded if observed (to be infered as absence points) in the EML metadata - taxonomic scope. Similar to how Anders Bryn listed mountain birch for a forest line dataset -- and write that missing taxa at an event can be inferred as absence points

Screenshot 2022-01-26 at 19 23 32

It is also possible to publish a checklist dataset of the taxa looked for - and refer to this from the EML metadata - taxonomic scope - similar to how Kjell Bjørklund did for his Radiolaria datasets.

The actual geometry of the polygon for sites can be reported as well-known-text.

rukayaj commented 2 years ago

I'm not sure how the coordinate data should be added - the easiest may be to use the polygon center coordinates, but I would like to add at least a map showing the polygons (1-15) that were sampled.

If you have the WKT for the polygons we can use https://dwc.tdwg.org/terms/#dwc:footprintWKT. Edit: I just noticed that Dag mentioned this as well at the bottom of his reply 👍

evalieungh commented 2 years ago

OK, then we will have to state clearly that absences can be inferred for the included species, but that's all right. Is it OK to upload the polygon geometry as WKT together with additional information about the polygons (area, centroid coordinates, etc.), or should that be a separate file?

rukayaj commented 2 years ago

I think with the WKT should be enough and we won't need area, centroid etc. There's a list of available dwc location fields here: https://dwc.tdwg.org/terms/#location . We could add it to locationRemarks if you want?

evalieungh commented 2 years ago

I think all the additional variables could be extracted from the polygons themselves if the user has a GIS program and a DTM (terrain model), so it's not strictly necessary. I also plan on using a GitHub repository to store some additional resources for the paper, so I could add it there to be publicly available simultaneously with the main data.

rukayaj commented 2 years ago

@evalieungh is there anything that we can help with for this? :)

evalieungh commented 2 years ago

a WKT polygon file is now up in IPT, so I just need to finish the text/metadata I think and then it's ready for publishing. If you think the data format is correct as it is now, that would be good to check before publishing.

rukayaj commented 2 years ago

Ok cool, the WKT looks good! So I just want to double check that polygon id 1 = event p1-2012, p1-2019, p1-2020 etc? And do you have the names somewhere of the people who checked the polygons each year?

evalieungh commented 2 years ago

Yes, polygon 1 should correspond to p1, and so forth. The data collectors are listed in the metadata and described in the text boxes.

rukayaj commented 2 years ago

Ok super! They'll need to go into the dwc fields, but I will add them in quickly.

rukayaj commented 2 years ago

@evalieungh You have 2012 data but don't mention this year in the metadata. Is it supposed to be excluded from this dataset?

evalieungh commented 2 years ago

Oh, my mistake. All the 2012 is supposed to be 2009! In the beginning I wasn't sure which year the data were from, so I just used 2012 until I could check properly... Is it possible to do a "replace all" now, or should I delete that file and upload again with correct year?

rukayaj commented 2 years ago

Not a problem, I can do a replace all!

It's also complaining that the footprint wkt is invalid https://www.gbif.org/tools/data-validator/d84f33c7-42b9-4116-b353-5db1bee5a7d2. I'm not totally sure why. I'll try fix this as well but might need @MichalTorma's help...

evalieungh commented 2 years ago

Hmm, I could also check in QGIS if there are any anomalies. I didn't make the polygons myself, and haven't checked the topology if there are overlaps or lines crossing each other etc.

Another thing we have to fix is that some species names have changed. My supervisor suggested the named be updated against the Nomenclature Database, and some of the species' names are outdated in my data file. I'm going through it in the metadata taxonomic coverage section, but the data file needs changing too :(

rukayaj commented 2 years ago

Hmm I believe it shouldn't matter if we publish older names if the updates are straightforward (i.e. it's just a name change, not a split or whatever), GBIF should map the old names onto new ones. We can try publish to GBIF and see if it complains, what do you think? People often publish quite a few updates. The only thing which you can't change after publication are the record identifiers.

rukayaj commented 2 years ago

All the 2012 is supposed to be 2009!

So just to confirm, you have then two visits to each patch in 2009? Because there is already 2009 data in there.

evalieungh commented 2 years ago

All the 2012 is supposed to be 2009!

So just to confirm, you have then two visits to each patch in 2009? Because there is already 2009 data in there.

Wait a minute, no there was only one visit per patch in 2009. It may be a duplicate. Wait a bit and I'll double check! There was some field work in 2012 as well, which is why I confused the years, but the main field work which I have the dat from should only be one visit per patch in 2009. (Plus my re-surveys in 2019 and 2020 ofc.)

evalieungh commented 2 years ago

OK, so from what I can see in the data_list file, for the five first species 2012 is given as the first year instead of 2009. I could not see any species where both 2009 and 2012 were present. So for some reason I failed to replace all the 2012 with 2009, but there don't seem to be any duplicates. It should be safe to do a search&replace of 2012->2009 (but mind row/occurrence numbers 2009 and 2012!).

These are the species names that have changed: Cotoneaster integerrimus --> C. scandinavicus Erysimum strictum --> E. virgatum Lappula myosotis --> L. squarrosa Odontites vernus ssp. litoralis --> Odontites litoralis Poa alpina var. alpina --> Poa alpina alpina Rhamnus catharticus --> Rhamnus cathartica Sorbus aria --> Aria edulis

I guess it's OK to leave them, then. The updated names are in the metadata and if GBIF fixes the rest that works well! Might be useful to have the old names as well since they correspond better to the names in the floras we used in the field.

dagendresen commented 2 years ago

It's also complaining that the footprint wkt is invalid https://www.gbif.org/tools/data-validator/d84f33c7-42b9-4116-b353-5db1bee5a7d2.

I tested to replace "MULTIPOLYGON ((( ... )))" with "POLYGON (( ... ))" and now validates fine https://www.gbif.org/tools/data-validator/d84f33c7-42b9-4116-b353-5db1bee5a7d2

rukayaj commented 2 years ago

I tested to replace "MULTIPOLYGON ((( ... )))" with "POLYGON (( ... ))" and now validates fine https://www.gbif.org/tools/data-validator/d84f33c7-42b9-4116-b353-5db1bee5a7d2

Yay! Thanks @dagendresen !

rukayaj commented 2 years ago

Ok it's looking good now. It looks like you've finished with the metadata so I'm going to go ahead and publish. We can always add to the metadata some more after publication. Then you can grab the EML rtf and use that as the basis for the data paper.

One thing I noticed:

In 2009 and 2019, each polygon was also described in terms of NiN v.2 variables ([referanse til NiN 2 kartleggingsveileder?]). The following variables were recorded:

  • Tree cover
  • Shrub cover

If you have this data for each polygon/patch we can publish it as well!

evalieungh commented 2 years ago

The metadata text was not 100% ready, but I've edited it now and published a new minor version. The variables you mention are not necessarily very useful, so I think I'll rather just put them on the linked GH repo after I've looked at the numbers. Thanks for helping out so much with this, it took surprisingly long to finish all the details!

rukayaj commented 2 years ago

https://www.gbif.org/dataset/a99cf6c0-4eb2-476b-8414-a513f0925d86 🚀🚀🚀🚀

dagendresen commented 2 years ago

Screenshot 2022-02-23 at 14 00 44

Screenshot 2022-02-23 at 14 04 38
MichalTorma commented 2 years ago
Screenshot 2022-02-23 at 17 13 22

They did keep the roads tough :)

dagendresen commented 2 years ago

They did keep the roads tough :)

It is not the "Open Street Map" for nothing ...

evalieungh commented 2 years ago

Looks like Malmøya and a few others are also missing. Also, on Gressholmen there are paths, not roads!

But anyways. I imported the EML of the metadata into ARPHA and I kind of regret going though this whole IPT thing. Some fields have been mixed up, the keywords and citations were not transferred, and there was no Abstract or Introduction section in IPT so it feels like I have to start all over again. It would have been much easier to fill in straight on ARPHA!

rukayaj commented 2 years ago

Oh no!!! I'm very sorry to hear that. We should definitely give some feedback to ARPHA about their EML import. I just tried it too and you're right, the citations in particular are going to be really annoying to do. The abstract/description is there though? Or am I missing something?

I don't think this has been a waste of time though, for your data paper you would need the data to be published anyway, and data published in GBIF will have much further reach than just a spreadsheet in Zenodo (so you'll get more citations and exposure).

evalieungh commented 2 years ago

The IPT 'Description' was imported two or three places, but not as an introduction. And apparently the ARPHA 'Abstract' has two sections, where the IPT text was just loaded into the first. So I got a bunch of errors when trying to 'validate' in ARPHA.

About the references, I use zotero so it would have been better to just import the relevant ones in ARPHA and skip them completely in IPT. The IPT thing is too manual and gives no option to rearrange them etc, so the bibliography in our published version is wrong.

rukayaj commented 2 years ago

About the references, I use zotero so it would have been better to just import the relevant ones in ARPHA and skip them completely in IPT. The IPT thing is too manual and gives no option to rearrange them etc, so the bibliography in our published version is wrong.

I can edit the xml manually and quickly remove the bibliography if you think that would be better than what we have? The IPT should really just have a bibtex import or something, it's annoying to have to copy and paste into blocks...

evalieungh commented 2 years ago

It would help if the IPT at least had some way of sorting the boxes alphabetically, so the list doesn't get broken if a new reference is added. But yeah, importing from other tools would be much better! In our data set I think it's OK to leave it for now, even if the order is a bit random.

dagendresen commented 2 years ago

There is a roadmap and a budget line towards and an IT developer employed for further upgrades for the IPT - and the GBIF Secretariat very much welcomes suggestions to the roadmap and feature requests - however, the plan is for new functionality to be ready starting from next year (2023) - and I am not sure where the roadmap is located... probably here somewhere https://github.com/gbif/ipt

evalieungh commented 2 years ago

Hi again! Just got the data paper back from technical review and they point out some missing data. Could you guide me in how to add them? I can find the centroid coordinates in existing files, and the uncertainty measured as the longest distance from the centroid to polygon boundary.

event.txt

"Because the data will be added as occurrence records in GBIF, the datasets should comply with GBIF's requirements and recommendations for both sampling events (https://www.gbif.org/data-quality-requirements-sampling-events) and occurrence records (https://www.gbif.org/data-quality-requirements-occurrences)." So they ask to add these Darwin Core fields:

Reason being that the "footprintWKT and footprintSRS entries are fine, but GBIF also wants a point, a point-radius uncertainty and a datum for the point. The point should be in the centre of each polygon, and the cUIM should be large enough to include the whole of the polygon. Because GBIF currently handles WKT data incorrectly (see https://www.datafix.com.au/BASHing/2021-11-17.html), it's the point, point-radius and datum that are needed for data users."

occurrence.txt

rukayaj commented 2 years ago

Oof! Ok, I didn't realise samplingProtocol https://dwc.tdwg.org/terms/#dwc:samplingProtocol was required.

samplingProtocol The names of, references to, or descriptions of the methods or protocols used during an Event. Examples: UV light trap, mist net, bottom trawl, ad hoc observation | point count, Penguins from space: faecal stains reveal the location of emperor penguin colonies, https://doi.org/10.1111/j.1466-8238.2009.00467.x, Takats et al. 2001. Guidelines for Nocturnal Owl Monitoring in North America. Beaverhill Bird Observatory and Bird Studies Canada, Edmonton, Alberta. 32 pp., http://www.bsc-eoc.org/download/Owl.pdf

So I guess they are looking for something like you suggested. I can't really think what else to put. I see samplingSizeValue & samplingSizeUnit are required fields, I'm surprised they're not asking for those to be included as well.

samplingSizeUnit The unit of measurement of the size (time duration, length, area, or volume) of a sample in a sampling event. Example: minute, hour, day, metre, square metre, cubic metre

samplingSizeValue A numeric value for a measurement of the size (time duration, length, area, or volume) of a sample in a sampling event. Example: 5 for sampleSizeValue with metre for sampleSizeUnit.

I suppose that here they want us to put the polygon area and square metre. Do you also have the polygon area as well as the centroid coords + uncertainty, by any chance?

"Please add authorship to the names in scientificName." -- does this mean e.g. L. for Linnaeus after 'his' species?

Yes, I think this is what they mean. GBIF does this automatically in its data interpretation, so I think we can cheat and just download the GBIF data and add it back into our occurrence file. Ditto for the kingdom and taxonRank. I'll have a look at doing that today.

evalieungh commented 2 years ago

Hmm, maybe it's best to add samplingsizevalue/-unit too, while we are at it. After some QGIS-fu to calculate the max distance from centroids to polygon edges (=coordinateUncertaintyInMeters), I have the following data that should be easy to link back to the existing:

patch,samplingSizeUnit,samplingSizeValue,polygoncode,xcoord,ycoord,coordinateUncertaintyInMeters 1,area_m2,811,35_1,596244.5216,6639701.617,455.1633153 2,area_m2,2301,35_2,596216.6074,6639757.417,446.0672937 3,area_m2,368,35_3,596228.0198,6639800.97,415.0334246 4,area_m2,1680,35_4,596409.1053,6639713.093,333.4101311 5,area_m2,273,35_5,596273.2295,6639675.247,451.0081179 6,area_m2,264,35_6,596337.9393,6639710.572,380.3104151 7,area_m2,477,35_7,596509.7124,6639985.691,501.0013657 8,area_m2,769,35_8,596495.648,6639904.596,439.1375363 9,area_m2,1663,35_9,596467.856,6639859.495,390.8717009 10,area_m2,1287,35_10,596443.5544,6639794.095,337.9232777 11,area_m2,639,35_11,596154.4757,6639655.047,554.8955866 12,area_m2,858,35_12,596159.8683,6639727.066,510.3619957 13,area_m2,1358,35_13,596556.8555,6639957.186,519.5745429 14,area_m2,1042,35_14,596288.0491,6639797.066,364.3970696 15,area_m2,2382,35_15,596401.0496,6639621.971,415.4127321

not sure if the samplingsizeUnit = area_m2 is correct terminology, but the unit is area in meters square at least. The coordinates are in CRS EPSG:32632 - WGS 84 / UTM zone 32N - Projected. Maybe it's good enough to just use "EPSG:32632", or should I find the WGS84 geographic coordinates?

I've edited the scientific names in the metadata "Taxonomic coverage" section at least, and then you can fix that +kingdom+rank in the occurrence.txt @rukayaj ?

rukayaj commented 2 years ago

Great! Yes, I've added all of the following:

To the event file:

To the occurrence file:

I also updated the source file that's in there. I would guess it will take a little while to show changes on gbif.org because they have to do the coordinate conversion.

rukayaj commented 2 years ago

Damn it, I forgot samplingProtocol and countryCode! I'll add it in now. Anything else I forgot?

rukayaj commented 2 years ago

Ugh it looks like we're going to need to convert the coordinates... image

rukayaj commented 2 years ago

Ok I converted the coordinates. I think it looks ok now. https://www.gbif.org/occurrence/3496819332

evalieungh commented 2 years ago

Amazing, @rukayaj! Thank you so much for helping. Now I just need to update the manuscript in ARPHA and hopefully it will go through technical review without further comment.

evalieungh commented 2 years ago

Sorry, but one more detail! I failed to round off the numbers, so the coordinateUncertaintyInMeters has way too many decimals! Can you help round it off to the nearest meter (5 decimal places or something?)?

rukayaj commented 2 years ago

@evalieungh Yes no problem, I've done it now! GBIF actually does round coordinate uncertainty in meters to 2 decimal places, but I've changed it so we're publishing to 4 decimal places anyway:

Screenshot 2022-03-22 at 08 47 25
dagendresen commented 2 years ago

:-D maybe Eva means decimalLatitude and decimalLongitude to 5 decimals (which is approx one meter). coordinateUncertaintyInMeters with 4 decimals is a precision of uncertainty less than a millimeter...

MichalTorma commented 2 years ago

Here is a handy guide for coordinate precision (by xkcd of course 😄) image

rukayaj commented 2 years ago

Feel free to reopen this one @evalieungh if there's still something we need to do here.

evalieungh commented 1 year ago

Hello again! I finally got the data paper reviewed, and it's accepted with major revisions (yay!), but there are some issues with the data:

  1. The reviewers want the absences added. I have a script for inferring absences and have uploaded a simplified data set with absences on the associated GH repo: https://github.com/evalieungh/gressholmen_data. They might accept this, but I agree it would be even better to publish the absences properly if possible.
  2. Some of the species names do not match in the manuscript table (based on ADB Artsnavnebasen names) and the IPT occurrence.txt data set (mapped to GBIF names I think?). We already discussed this (see below), but maybe we should update the IPT names to match the ones in Artsnavnebasen. I have to look at all the names again and find out which is the correct one - will get back to this issue tomorrow...

OK, so from what I can see in the data_list file, for the five first species 2012 is given as the first year instead of 2009. I could not see any species where both 2009 and 2012 were present. So for some reason I failed to replace all the 2012 with 2009, but there don't seem to be any duplicates. It should be safe to do a search&replace of 2012->2009 (but mind row/occurrence numbers 2009 and 2012!).

These are the species names that have changed: Cotoneaster integerrimus --> C. scandinavicus Erysimum strictum --> E. virgatum Lappula myosotis --> L. squarrosa Odontites vernus ssp. litoralis --> Odontites litoralis Poa alpina var. alpina --> Poa alpina alpina Rhamnus catharticus --> Rhamnus cathartica Sorbus aria --> Aria edulis

I guess it's OK to leave them, then. The updated names are in the metadata and if GBIF fixes the rest that works well! Might be useful to have the old names as well since they correspond better to the names in the floras we used in the field.

evalieungh commented 1 year ago

About absence data (point 1.), could it be a solution to add a new file to the IPT data set with only the absences? We already discussed this at the top of this issue thread (e.g. https://github.com/gbif-norway/helpdesk/issues/87#issuecomment-1022478311), but maybe there are other ways. The status quo it that we just published the presences, and then I have an R script for anyone to wishes to infer the absences. So it's useless for someone who is e.g. downloading all absence data for a species on GBIF...