gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
127 stars 57 forks source link

Idea: Support supplementary files in the DwC-A and FrictionlessData output #2109

Open timrobertson100 opened 11 months ago

timrobertson100 commented 11 months ago

I'd like to share an idea to see if this would be of interest to the IPT community. Feedback sought.

I propose the IPT be enhanced to allow a user to upload additional supplementary files that relate to the dataset, and have them included in the DwC-A or FrictionlessData package output.

I expect the uses for this could be many and would make the IPT an even more useful data repository, but concretely we're moving towards the inclusion of Newick files for phylogenies (example) that would be well suited for this. Over the years I'm aware of a desire for the IPT to host some images for a dataset too, although that may require special attention due to size.

I imagine the output for DwC-A could include:

/ 
  - eml.xml
  - occurrence.txt
  - meta.xml
  - supplementary-data
      - images
        - image1.jpg
        - image2.jpg
      - sequences
        - newick1.nwk
      - other
        - procedures.pdf

When uploading supplementary data, the user would select a category from a drop-down (e.g. images) and the IPT could enforce limits on individual file size, total file count, and total size or so.

I'd expect the IPT to be able to serve the files on a URL such as https://ipt.example.com/resource/2021-dwc-updates/supplementary_data/images/image1.jpg or so.

I haven't given much thought to licensing but imagine we might need a license per file to e.g. allow for restrictive licenses on images. We might also take the opportunity to enforce permissive licenses that allow research (e.g. only allow CC0, CC-BY) for this, or even require that it falls under the same license as the dataset as a whole.

Thanks

ymgan commented 11 months ago

Yes please, we have been hoping to be able to attach images for sampling methods section in the metadata. This would be so helpful! Thank you for the idea!

mike-podolskiy90 commented 11 months ago

Thanks Tim!

peterdesmet commented 11 months ago

Neat! For images, it would be useful if the mapping page for the Audubon Media Extension offered some smart solutions for accessURI to automatically create a path to the included images, based on a fixed prefix (https://ipt.example.com/resource/2021-dwc-updates/supplementary_data/images/) and the provided image name (image1.jpg).

mdoering commented 11 months ago

That's very useful indeed! I regularly bundke raw data files siuch as db dumps, excel sheets or pdfs with archives if they are not too large.

I proposed this also as part of the ColDP support: https://github.com/gbif/ipt/issues/1979#issuecomment-1513594846 ColDP allows to share a binary logo image and also some reference bibtex file. If users could upload these that would be definitely useful. Those would have to be named according to the standard though and could not live in a supplementary subfolder.

MattBlissett commented 11 months ago

Neat! For images, it would be useful if the mapping page for the Audubon Media Extension offered some smart solutions for accessURI to automatically create a path to the included images, based on a fixed prefix (https://ipt.example.com/resource/2021-dwc-updates/supplementary_data/images/) and the provided image name (image1.jpg).

Links within the DWCA could be absolute (if the IPT serves the images) or relative (supplementary-data/images/image1.jpg), pointing to the files within the archive.

timrobertson100 commented 11 months ago

Adding for context. A GBIF Node that makes use of our hosted IPT service has asked if GBIF can help with the hosting ~300GB of images.

If the idea in this issue progresses, we might consider the mechanics of how a user would get 100s GBs of images into the IPT bearing in mind they may not have access to the file system. e.g. an asynchronous import tarball from URL function or so might be appropriate.

mdoering commented 11 months ago

would you want to add that much extra data to the archive itself? Maybe a 2nd supplementary archive file might be better in that case. Though keeping it all together is one of the great advantages

timrobertson100 commented 11 months ago

Good point. Media files might also be a special case to consider too, compared to typical supplementary CSV, newick files etc which are likely to be far smaller.

On the packaging there are 3 options we could consider -

  1. all together as one,
  2. as one or more secondary supplementary archives to accompany the DwC-A, or
  3. two archives where one includes and one doesn't include the supplementary data
peterdesmet commented 11 months ago

I think the use case for hosting/archiving media files is maybe better provided by a separate media service, potentially provided by GBIF, potentially included in the IPT. For such large volumes, it doesn't seem scalable to include 300GB of media in an archive and republish it for every version.