gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
127 stars 57 forks source link

Make source files downloadable with datasets typed "other" #1847

Open ckotwn opened 1 year ago

ckotwn commented 1 year ago

IPT can host datasets that neither use one of the cores nor metadata-only. When creating a resource, use Other (though optional), will allow publishers to deposit source data that are yet possible to map with standards supported by GBIF. An example is this gazetteer dataset: https://ipt.taibif.tw/resource?r=gazetteerofnechina, despite improvement needed for metadata.

Currently, only metadata will be available for download once published, given that it's not possible to generate a DwC-A. IPT should expose the source data as they are, to allow publishers to truly share miscellaneous types of data, and make data available for future GBIF indexing.

As more and more people are asking for sharing non-DwC datasets, this would be our answer.

For implementation, perhaps we want to block files other than plaintext. Because files like media obviously are subject to discussions at the strategic level.

mike-podolskiy90 commented 1 year ago

@ckotwn Thank you for the suggestion, we'll consider that

abubelinha commented 1 year ago

@ckotwn The description of the example you show says gazetteerofnechina: "Village-level administrative divisions in Northeast China".

That looks like geospatial data, but not a biodiversity-related dataset.

I think general data repositories like figshare or zenodo are much more appropriate for this kind of datasets.
If you need to have them also available in the IPT/GBIF, you can just put your files in one of those repositories and link them from your "Other" IPT dataset. Can't you?

Maybe I am misunderstanding your issue.

ckotwn commented 1 year ago

@abubelinha Thanks for the comment.

Well, it's not really my "other" type of dataset ;) IPT has this feature, and I think it stops half-way in the implementation. And now I've found a good reason to complete that.

But the dataset of a gazetteer may not be the best example. This dataset comes from a BIFA project when a part of the project output isn't yet possible to index. So in this case, I think it makes sense for data users that this dataset stays within the same IPT installations, and it saves data publishers the burden of maintaining two copies of the metadata online.

Gazetteers may be a grey-zone category. As similar datasets are available all over, I think it probably doesn't hurt if it's convenient for supporting the interpretation of biodiversity datasets within GBIF.

What this feature may be truly useful is for those datasets that are biodiversity related but not yet conformed to any standard.

There could be further discussions about what categories we would consider valid for sharing using this Other type. But as IPT is a publishing toolkit, whatever is uploaded, IMO should be accessible once published.

peterdesmet commented 1 year ago

This is indeed a larger issue that touches on what the scope of the IPT should be.

ManonGros commented 1 year ago

I agree that since the IPT allows to create an "other" type of dataset and to upload the files, it is too bad that those files aren't publicly accessible. It would be nice to have something consistent.

abubelinha commented 1 year ago

I see. I recognize I had never noticed about that "other" option. But I had always thought the only target datasets of IPT were those suitable for being mapped to DwC (+extensions):

"The Integrated Publishing Toolkit (IPT) is a software package developed to support biodiversity dataset publication in a common format. The IPT’s two primary functions are to 1) encode existing species occurrence datasets and checklists, such as records from natural history collections or observations, in the Darwin Core standard to enhance interoperability of data, and 2) publish and archive data and metadata for broad use in a Darwin Core Archive, a set of files following a standard format."

... from Robertson T, Döring M, Guralnick R, Bloom D, Wieczorek J, Braak K, et al. (2014) The GBIF Integrated Publishing Toolkit: Facilitating the Efficient Publishing of Biodiversity Data on the Internet. PLoS ONE 9(8): e102623. DOI 10.1371/journal.pone.0102623

@ckotwn : What this feature may be truly useful is for those datasets that are biodiversity related but not yet conformed to any standard.

@ManonGros : I agree that since the IPT allows to create an "other" type of dataset and to upload the files, it is too bad that those files aren't publicly accessible. It would be nice to have something consistent.

Do you mean bio-data sources which are not (yet) mapped to DwC, should be downloadable anyway whenever the dataset is of "other" type? I think that would be against the above described IPT purpose. Publishers could begin to use more and more that lazy option and forget about the not-so-hard work of mapping to the standard. And that would make data almost-useless (not searchable or downloadable through portal and api-related tools, which is what really makes GBIF so great). IMHO the IPT was not designed for that, and we should use other external servers instead (just for the data: I don't see a reason to reproduce all metadata in there as @ckotwn suggests). So I guess from the IPT perspective this would be a "metadata only" dataset.

So what are supposed to be those "other" IPT datasets? I don't clearly understand the motivations for the IPT having an "other" option, but I don't think the idea behind that was letting users upload data in DwC-unmappable formats. Looking to IPT manual under Basic metadata section, I get the impression that "type" always refers to different types of DwC implementations, and nothing else:

Type - the type of resource. The value of this field depends on the core mapping of the resource and is no longer editable if the Darwin Core mapping has already been made. If a desired type is not found in the list, the field "other" can be selected. Review the information under the "Configure Core Types and Extensions" heading of the "Administration Menu" section.

Perhaps IPT-designers were thinking about other kinds of DwC datasets which may not be well described by the classic "checklist", "occurrence", "sampling event" or "metadata only" denominations. So the IPT authors decided to create an option for "other" DwC possibilities.

Maybe IPT authors could clarify this.