gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

ANO data to GBIF #181

Open andersfi opened 1 month ago

andersfi commented 1 month ago

A request relayed from The Norwegian Environmental Agency (MDIR) on possibilities to publish the Area Representative Monitoring data (Arealrepresentativ naturovervåkning - ANO) on GBIF.

This is a very important dataset for both research and management and we should prioritize to help this get out.

The data are available for download in a .gdb format from MDIRs homepage (they promise stable URL and stable data structure). The dataset is updated once a year. I think the mapping is fairly straight forward, but with some small issues (mainly related to hiearcial sampling design and IDs). A suggestion is to facilitate this and speed up the publishing by putting up a pipeline for mapping the data from the .gdb database to a dwc-a and publish on GBIF.no's IPT.

A document describing the dataset and tentative mapping is found here: https://docs.google.com/document/d/1ozhrI2xdN5dK0FgiQ-vBE-_NNEXaAGrLhJflKzUK9Dw/edit?usp=sharing (sorry, only in Norwegian, mainly used for communication with MDIR until now).

MichalTorma commented 1 month ago

So I had a look at the data and they seem to make some sense. Inside the GDB file there are 6 tables:

So If I understand it correctly, there should be a parent event for ANO_Flate which would have child events for ANO_SurveyPoint and each ANO_SurveyPoint event would have occurrences of species (species + alien species + invasive species + tree species) NiN mapping and other measurements would be EMoF extension.

Does that sound reasonable?

MichalTorma commented 1 month ago

I have a question about art_dekning column in ANO_Art table. I understand that this should be coverage on the plot but some numbers there seem to be in the form of 0.1 and sometimes there is a round number like 24. it seems to me that 0.1 should actually mean 10%.

EDIT: So I went a bit deeper. and I'm not sure if this is a mistake or not. There is an obvious bias towards 0.1 in particular as you can see on the histogram here: output2 It might mean that if there is only one specimen on the plot, surveyors choose 0.1% coverage.

andersfi commented 1 month ago

I am not sure why this bias towards 0.1. This seems odd - I also interpreted this as coverage in %. Maybe as simple as that those doing the mapping put in 0.1 as default value if the coverage is very slim and close to 0?? I think we need to get in touch with the data owner to clarify.

andersfi commented 1 month ago

So I had a look at the data and they seem to make some sense. Inside the GDB file there are 6 tables:

  • ANO_SurveyPoint - point layer - individual points surveyed
  • ANO_Flate - polygon layer - encompassing ANO_SurveyPoint but some plots have no ANO_SurveyPoint entry
  • ANO_Art - simple table - list of species of vascular plants found on the ANO_SurveyPoint. connected using ParentGlobalID(ANO_Art) > GlobalID (ANO_SurveyPoint)
  • ANO_FremmedArt - same as ANO_Art but for alien species
  • ANO_ProblemArt - same as ANO_Art but for invasive species
  • ANO_Treslag - same as ANO_Art but for tree species

So If I understand it correctly, there should be a parent event for ANO_Flate which would have child events for ANO_SurveyPoint and each ANO_SurveyPoint event would have occurrences of species (species + alien species + invasive species + tree species) NiN mapping and other measurements would be EMoF extension.

Does that sound reasonable? Yes, this sounds reasonable. If I understand it rigth, there is different sampling methods on "species", "alien species", "tree species" and "NiN mapping". Accordingly this sounds like own events? Should not be technical difficult to sort out, however, we lack a GUID for identifying this event. Maybe we should use composite identifier for these events instead of adding a GUID? Need to discuss this with data-owner(?)

MichalTorma commented 1 month ago

So I had a look at the data and they seem to make some sense. Inside the GDB file there are 6 tables:

  • ANO_SurveyPoint - point layer - individual points surveyed
  • ANO_Flate - polygon layer - encompassing ANO_SurveyPoint but some plots have no ANO_SurveyPoint entry
  • ANO_Art - simple table - list of species of vascular plants found on the ANO_SurveyPoint. connected using ParentGlobalID(ANO_Art) > GlobalID (ANO_SurveyPoint)
  • ANO_FremmedArt - same as ANO_Art but for alien species
  • ANO_ProblemArt - same as ANO_Art but for invasive species
  • ANO_Treslag - same as ANO_Art but for tree species

So If I understand it correctly, there should be a parent event for ANO_Flate which would have child events for ANO_SurveyPoint and each ANO_SurveyPoint event would have occurrences of species (species + alien species + invasive species + tree species) NiN mapping and other measurements would be EMoF extension. Does that sound reasonable? Yes, this sounds reasonable. If I understand it rigth, there is different sampling methods on "species", "alien species", "tree species" and "NiN mapping". Accordingly this sounds like own events? Should not be technical difficult to sort out, however, we lack a GUID for identifying this event. Maybe we should use composite identifier for these events instead of adding a GUID? Need to discuss this with data-owner(?)

I mean we don't really have any additional info for the events (except for the info we already have in ANO_SurveyPoint) and we can specify the sampling method on the record level instead of the event level - that would simplify the overall structure (and would look better on the dataset page after publication)

But we can do it like you said as well of course :) we need to have a meeting with data-owner

kjetpett commented 1 month ago

I remember asking Ole Einar about the 0,1 value for dekning, because it puzzled me when i imported the survey data: "0,1% stemmer. De setter den verdien når de finner typ ett individ av en liten art. Har fått innspill på at de ønsker å kunne sette 0,1 % kontra 1%. Derfor vil eldre data ha 1 % som laveste dekning." translated: "0,1% is correct. They [the surveyors] use this value when they find a single specimen of a small species. They would rather use 0,1% than 1%. Because of this older data will have 1% as the lowest value for dekning".

I discovered this when importing survey data from 2023, so I basically think that means they used 1% for 2019-2022 and 0,1% in 2023.

kjetpett commented 1 month ago

Correct, some ANO_Flate have no ANO_SurveyPoint. From what I remember this is because the Flate polygons are randomly chosen and a few of them are in areas where no point can be surveyed (a lake, a very steep mountain side etc).

andersfi commented 1 month ago

So I had a look at the data and they seem to make some sense. Inside the GDB file there are 6 tables:

  • ANO_SurveyPoint - point layer - individual points surveyed
  • ANO_Flate - polygon layer - encompassing ANO_SurveyPoint but some plots have no ANO_SurveyPoint entry
  • ANO_Art - simple table - list of species of vascular plants found on the ANO_SurveyPoint. connected using ParentGlobalID(ANO_Art) > GlobalID (ANO_SurveyPoint)
  • ANO_FremmedArt - same as ANO_Art but for alien species
  • ANO_ProblemArt - same as ANO_Art but for invasive species
  • ANO_Treslag - same as ANO_Art but for tree species

So If I understand it correctly, there should be a parent event for ANO_Flate which would have child events for ANO_SurveyPoint and each ANO_SurveyPoint event would have occurrences of species (species + alien species + invasive species + tree species) NiN mapping and other measurements would be EMoF extension. Does that sound reasonable? Yes, this sounds reasonable. If I understand it rigth, there is different sampling methods on "species", "alien species", "tree species" and "NiN mapping". Accordingly this sounds like own events? Should not be technical difficult to sort out, however, we lack a GUID for identifying this event. Maybe we should use composite identifier for these events instead of adding a GUID? Need to discuss this with data-owner(?)

I mean we don't really have any additional info for the events (except for the info we already have in ANO_SurveyPoint) and we can specify the sampling method on the record level instead of the event level - that would simplify the overall structure (and would look better on the dataset page after publication)

But we can do it like you said as well of course :) we need to have a meeting with data-owner

Well, I am very happy with compromises and everything that makes life simpler - however, we need to be able to pinpoint a taxonomic scope to the various events. Feks the "invasive species" and "tree species" will have different taxonomic scope and I can't figure out how to document this except on the Humbolt extention?