Publishing data to GBIF

sacrevert commented 3 years ago

This is a placeholder for a longer-term issue, but ultimately we would like project organisers to choose whether to publish their project data to GBIF taking advantage of work that @johnvanbreda has already done on formatting indicia exports for the BRC GBIF IPT.

johnvanbreda commented 3 years ago

Since this is a potentially complex requirement I think it would be a good idea to document the primary driving factors for wanting to do this, so we can make sure we design the simplest solution for the reuqirement.

sacrevert commented 3 years ago

In my mind, the project level publishing was to allow flexibilty for individual project managers/data collectors. We would then allow for individual projects to chose different licences, and it would be easier for downstream users to just get data they wanted with a DOI, rather than having to download everything from GBIF and then sort it. I was thinking that this would be particularly important for historic datasets (like the plots behind the National Vegetation Classification), because there is a clear advantage to having sets like this as individual citable items. However, it sounds like this could be difficult to achieve. Not sure if @kitenetter or @DavidRoy would like to comment?

DavidRoy commented 3 years ago

I agree on Oli's logic that this expert is best controlled at the 'project' level which I believe is at the 'group' level within the PlantPortal. @johnvanbreda presumably this is met by defining the query to define a dataset and then package that up in a download suitable for the GBIF IPT. The export also need to retain the dataset structure, i.e. DarwinCore event format

sacrevert commented 3 years ago

@andrewvanbreda can confirm which indicia structure defines the project

andrewvanbreda commented 3 years ago

@sacrevert Plant Portal projects are actually Indicia Recording Groups

johnvanbreda commented 3 years ago

Thanks all. Realistically, I think we can get project organisers to choose a licence for the project (since there is already a field for this), plus we could add a "published" flag. When this is set, we could automate the generation of the DwC files required for the IPT. However, on initial setup there will still be a manual process for setting up the dataset on the IPT and filling in the large number of metadata fields. We can't capture this information from within Indicia as I don't think the IPT has an API that would allow us to automate filling in the metadata, so it will have to be added manually.

sacrevert commented 3 years ago

Darwin Core Archive format is a zipped folder of data files and XML-type metadata files, so, presumably we could set-up a pages that would essentially create something that could be directly uploaded to the IPT with all the required metadata. However, no doubt this would be somewhat resource intensive to set-up in the first place. Is it worth it? @DavidRoy @johnvanbreda

DavidRoy commented 3 years ago

Let's do the metadata manually to start with until we can assess demand for more automation?

johnvanbreda commented 3 years ago

Although the DwC archive file can contain metadata, from memory when I set up a DwC dataset on the IPT you have to fill in a certain amount of extra information manually anyway - I might be wrong but I thought the metadata in the archive was ignored. Either way, it means the process of adding the dataset to the IPT initially needs to be done manually.

Currently I have a process which can be scheduled periodically to grab datasets from Elasticsearch and create DwC archive files, or just CSV files. These could be dumped in the appropriate location in the IPT to update the dataset (I think we can configure it to update automatically on GBIF). At the moment, this process uses JSON configuration files which define the filter and where the file should go - for example in my demo I grab the sawfly data from iRecord using an Elasticsearch query. I guess we could code this to pick up the data from all NPMS groups where a certain flag has been set by the project organiser and generate the IPT output files.

Given that the project organiser will need to tell someone with IPT access rights to set this up, I'm not sure there is any real benefit in adding complexity to Indicia to allow the project organiser to "turn on" publication. Maybe we could just have a Drupal contact form, configured to capture any useful metadata, then someone will have to configure the IPT and make a small tweak to the configuration of my tool in order to enable generation of the dataset.

Perhaps we could have a project example to trial this process on?

johnvanbreda commented 2 years ago

Plan of action is as follows:

Use the webforms module to create a form that allows project managers to fill in the metadata required to publish a dataset.
Webform allows an email to be sent when a submission is received. Unfortunately we can't automate the registration of the dataset on the IPT (or if it is possible, I can't find any documentation) so the dataset needs to be manually setup and the metadata copied over by someone with admin rights to the IPT.
Add tick-box on the groups (projects) edit page to publish the dataset.
If ticked, a module will generate a Dwc-A file for the dataset on a periodic interval.
A separate process that picks up the DwC-A files and updates the occurrences.txt (and potentially events.txt) files in the correct locations on the IPT.
The IPT can be configured to periodically update itself on GBIF so hopefully we can get it to automatically use the latest copy of these files.
Optionally - update the extraction process for DwC-A files to also generate an events.txt file so the samples structure is respected.

sacrevert commented 2 years ago

Thanks @johnvanbreda , that all sounds great. My only question is around the NBN (i thought I had asked this before, but perhaps I didn't) -- do you know whether the NBN can re-ingest data from GBIF, or could we simultaneously publish to both once everything was set up on the IPT? I'm just thinking that it would be good for the data from a project to be on both platforms, as, certainly for the local "NPMS+" category of projects, project admins are going to be more familiar with the NBN than with GBIF.

DavidRoy commented 2 years ago

afaik the NBN does not receive datasets via the IPT so need a manual supply. Also, I believe the NBN do not handle DcW event data but I might be wrong about that @kitenetter might be able to advise

sacrevert commented 2 years ago

@DavidRoy the NBN can handle event data, as the NPMS data are presented in this way. It sounds as though the only option might be for the NBN to reingest from GBIF. I will ask them about that now.

sophiathirza commented 2 years ago

We don't currently ingest any datasets from GBIF on the NBN Atlas, but we could. I think that we would use the endpoint DwCA on GBIF, which I imagine would be your IPT. I will investigate.

By event data - do you mean that the Event file is the core file? We might not be able to accept that until after the upgrade of the Atlas.

johnvanbreda commented 2 years ago

Even if the NBN cannot ingest data either from the IPT or GBIF, we always have the option of passing the DwC extraction from Indicia both to the IPT and to the NBN.

sacrevert commented 2 years ago

From @sophiathirza explaining to me why NBN can display occurrence groupings (events) but not ingest files where the events are core.

A DwCA has a 'core' file and optional extensions. Originally GBIF only allowed for either Occurrence or Taxon core files in the Archive, their archive assistant illustrates it nicely: http://tools.gbif.org/dwca-assistant/ - the Event information would be in an extension. But relatively recently they have started to allow archives where the Event file is the core and the Occurrence could be the extension, which would allow multiple occurrences for a single event (sample). There is more information here: https://www.gbif.org/darwin-core under the What's in an archive? header. It's just about how the information is stored in the Archive: a one-to-one relationship between the Event and the Occurrence, where the event information is repeated for each occurrence or a one-to-many relationship, where there is one file for the event and a second file for the occurrences. I have the same issue with Archives from the MBA's IPT and have a script to produce a single Occurrence csv that contains the event information for each record, but we really need the Atlas to be able to do it for us. The NPMS data from BRC comes in a single occrrence csv file

johnvanbreda commented 2 years ago

New Indicia Auto Exports module installed onto test version of IPT and instructions sent to @sacrevert.

johnvanbreda commented 2 years ago

@sacrevert I've now enabled the latest version of the auto exports module on the test version of the Plant Portal. The form is available at https://test-brc-plantportal.pantheonsite.io/webform/published_group_metadata and it now should have a complete set of the metadata for the IPT.

sacrevert commented 2 years ago

Thanks John! Are there now steps for @BirenRathod or @andrewvanbreda in terms of also making this available on live?

johnvanbreda commented 2 years ago

@sacrevert I would suggest an initial test in the test environment, then if it is OK the module just needs enabling on the live environment.

BiologicalRecordsCentre / plantportal

Publishing data to GBIF #23