BiologicalRecordsCentre / ABLE

Assessing ButterfLies in Europe project repository
2 stars 3 forks source link

GBIF exports for eBMS #714

Open DavidRoy opened 3 months ago

DavidRoy commented 3 months ago

Requirement to produce separate datasets per BMS region Export in Darwin event format

follow on from #6

@johnvanbreda can you define what is required here

johnvanbreda commented 1 month ago

Please can I clarify how I should work out the list of files to generate (i.e. the list of regions). On EBMS there are a list of schemes which are held in Drupal content. They each point to a location selected from the Countries 2016 list on Indicia. Countries in the European NUTS area can be used to find the child NUTS level 1 regions, then the child NUTS level 2 regions and it is these regions that are available for choosing as a region on a user's profile. But for countries outside the NUTS area, there are no regions available to select.

So, for a country inside the NUTS area (e.g. Germany BMS) there are 38 regions so would you like 38 datasets generated for Germany? Or just one dataset covering the entire scheme?

For a country outside the NUTS area (e.g. New Zealand BMS) would you like a single dataset? Or is there another way this can be broken down.

It obviously makes things simpler if it can be handled the same way globally, rather than having separate logic for NUTS vs non-NUTS areas, though it would be possible.

DavidRoy commented 1 month ago

Countries is the first place to start but could we work towards something that uses any spatial boundary to filter the dataset. There are some scenarios where we'd want to create a dataset for a NUTS1 region, or even a bespoke boundary (accepting that we'd need to load it and index records against it).

Could there be a filter to set an area from those layers we index?

DavidRoy commented 1 month ago

The first use case is Denmark as defined by the country layer.

johnvanbreda commented 1 month ago

The existing Darwin Core extraction script will allow you to define a filter, so yes, that could take any polygon or (preferably) indexed location ID as a filter. This will work well as it is for setting up individual exports one at a time, but I was wondering if we needed to automatically generate exports for a whole set of regions in one go. It sounds like I should work towards getting single exports set up first then we can consider any batch export after.

I've spent today working on updating the DwC extractor to support event data properly. Still a bit to do but all the principles are in place.

johnvanbreda commented 1 month ago

@DavidRoy the code is ready to extract event and occurrence data in Darwin Core Archive format, so I can set up a Denmark example. Would you like all EBMS data for Denmark, or should I limit it to certain surveys or apply any other filter to the extracted data?

DavidRoy commented 1 month ago

@johnvanbreda thanks. Four datasets please. 118.562 118.565 118.646 118.681

The latter two might not have any data for Denmark

Perhaps we need a standard way of naming the species datasets to reflect website, survey, geographic filter?

johnvanbreda commented 3 weeks ago

@DavidRoy the code is now in place to create DwC-archive files. A couple of questions:

  1. Are there any licences we should exclude from the exported dataset?
  2. I need to assign a datasetName to use in the output data. Would the following be OK: "EBMS Denmark transects", "EBMS Denmark 15 minute counts", "EBMS Denmark single-species 15 minute counts" and "EBMS Denmark fixed moth trap"?
  3. Shall I set the rightsHolder to UKCEH?
  4. Shall I add these datasets to the IPT then let you know so you can fill in the metadata?
DavidRoy commented 3 weeks ago

@johnvanbreda thanks for progressing this. Answer below:

  1. No
  2. Yes, these work
  3. I'll check and confirm. Is there a single rightsHolder
  4. Not yet. I'll confirm if the Denmark group are ready. Is there a convenient tool for writing the metadata?
johnvanbreda commented 3 weeks ago
  1. The rightsHolder can be different for each dataset.
  2. If I add the datasets to the IPT, then you can use the UI of the IPT to fill in the metadata via forms.
johnvanbreda commented 3 weeks ago

@DavidRoy I've added the Denmark transects dataset to the IPT (not published) so you can check the processes and see how the metadata editing works. Another option is to provide an EML file (XML document) - I could provide a template which can be edited in a text editor. I can add the 2 timed datasets to the IPT when you are ready.