dataset RDF - Githubissues

FynnBe commented 3 years ago

It's been on the horizon for a while now: It seems like we need a specialized dataset RDF.

To be discussed:

[ ] do we really need a specialized dataset RDF? if so:
- [ ] what additional information should be included?
- [ ] how do we achieve a stable dataset API?

To prevent this discussion from growing out of proportions, I would like to suggest that we open separate issues (that can be discussed with a subset of people) for specific aspects emerging here, so that we can keep an oversight here. Also I will try to keep a summary of what has been discussed and/or decided here:

ongoing discussion about general RDF.attachments: #148

oeway commented 3 years ago

As I commented here, I think it's too early to define spec for dataset, at this stage anyone can propose any sensible setting that are compatible with the general RDF spec, then use it and battle test it in real applications. At some point, we can then promote various of options for evaluation, discuss the best option to define as a config. I think it's totally fine to be a bit chaotic at the beginning, but that will give us a basis for converging.

oeway commented 3 years ago

Although I think it's too early for a spec, but I think it's definitely helpful to use this issue for discussing and getting different opinions.

I will share what I have for another project for single molecule localization microscopy dataset (I already mentioned in the attachments discussion).

Here is an real rdf file I have (https://sandbox.zenodo.org/record/876447):

type: dataset
name: 2CH test image
description: This is a test image
license: CC-BY-4.0
authors:
  - name: Christophe L.
tags:
  - STORM
  - smlm
cite: []
links: []
covers:
  - ./I1(COS)_CH/screenshot-0_thumbnail.png
attachments:
  samples:
    - name: I1(COS)_CH
      views:
        - config:
            scaleX: 1
            scaleY: 1
            scaleZ: 1
            pointSize: 5
            distance: 4
            fov: 16
            pointSizeMin: 0
            pointSizeMax: 12
            'Total # of locations': 1152480
            x: 1
            'y': 1
            z: 1
            point size: 3
            x min: 0
            x max: 1
            y min: 0
            y max: 1
            z min: 0
            z max: 1
            active 0: true
            color 0:
              - 255
              - 28
              - 14
            alpha 0: 0.85
            active 1: true
            color 1:
              - 0
              - 255
              - 255
            alpha 1: 0.85
            Fps: 51
            files:
              - I1(COS)_CH2_clathrin.xls
              - I1(COS)_CH1_microtubules.xls
            viewer_type: window
          image_name: screenshot-0.png
      files:
        - name: data.smlm
          size: 14637650
          checksum: 293djf39ssf234s23423ef2324sdg34
id: 10.5072/zenodo.876447

FynnBe commented 3 years ago

if you just replace attachments with config in your example we have deal 😉

oeway commented 3 years ago

if you just replace attachments with config in your example we have deal 😉

Well, yes, but then we don't need to even start the discussion. Let's say if I add an attachments key to config, the same issue persists, no?

FynnBe commented 3 years ago

Well, yes, but then we don't need to even start the discussion. Let's say if I add an attachments key to config, the same issue persists, no?

no, anything in config is--per current specification--free for all... what to include there is totally up to you and the place to try out new fields, mechanisms, etc... and we do not expect to make these entries relying on config cross-compatible at all.

oeway commented 3 years ago

no, anything in config is--per current specification--free for all... what to include there is totally up to you and the place to try out new fields, mechanisms, etc... and we do not expect to make these entries relying on config cross-compatible at all.

I got that, what I meant is that it does pass the validator for the current version, but we still don't have a solution to display a list of attached objects, to any RDF. Again, we cannot display a list of raw URI to the user and we cannot pull them because each URI maybe in GB-size.

tomburke-rse commented 3 years ago

I got that, what I meant is that it does pass the validator for the current version, but we still don't have a solution to display a list of attached objects, to any RDF. Again, we cannot display a list of raw URI to the user and we cannot pull them because each URI maybe in GB-size.

The middle way would be a dict[Str, URI], where the string is a descriptive name for the URI so that the user knows what is behind the URI and we do not pull that URI up. That's a middle ground and easy solution to the nice card with n meta infos which I still think would be overkill with anything above 20 items, but that's just my user behaviour opinion.

oeway commented 3 years ago

Let's forget about changing attachments key, and assume we now have another key called attached_items for storing dataset.

The middle way would be a dict[Str, URI], where the string is a descriptive name for the URI so that the user knows what is behind the URI and we do not pull that URI up.

Well, first it does solve the readability issue, for example if we attach 1000 images to a dataset, we only see names like image-1.png, image-2.png image-3.png... , you see what I mean, we need a thumbnail in this case, potentially also size of the image etc. Another issue of using only a dict is that you cannot sort the items, which is important to reproduce the training of a model for example.

If you want to store a label to the sample, it's not possible, and if you want to store image-mask pairs for segmentation dataset, again not straightforward.

That's a middle ground and easy solution to the nice card with n meta infos which I still think would be overkill with anything above 20 items, but that's just my user behaviour opinion.

Why is that? The thing is that we don't necessary display all at once, with the metainfo in a dataset contain thousands to millions samples, we can display the first 20 items, and show a search bar which allow others to quickly search with meta info.

tomburke-rse commented 3 years ago

Is that really a use-case in a dataset? I can understand to give some samples to the interested user, but wouldn't a handful suffice (maybe in config:covers?)? I'd expect someone interested in a dataset to just download the whole thing (as a zip or any other format) and not handpick singular images out of it. Same goes for the sufficiently large n number of examples: Searching is nice and all, but then I already kind of know what is in the dataset to search for.

I'm not really convinced that this is actually needed. It feels more like the dataset RDF could use a samples List[Sample] field to handle that task you just described. Wouldn't that be enough?

oeway commented 3 years ago

Is that really a use-case in a dataset? I can understand to give some samples to the interested user, but wouldn't a handful suffice (maybe in config:covers?)? I'd expect someone interested in a dataset to just download the whole thing (as a zip or any other format) and not handpick singular images out of it. Same goes for the sufficiently large n number of examples: Searching is nice and all, but then I already kind of know what is in the dataset to search for.

I have a bit different perspective for seeing how to use these. Here is the thing, we are dealing with larger and larger dataset, take the HPA dataset for example, we have over 10TB of images in tiff format. And in our Kaggle image classification challenge, we used a subset of it, but still can go upto TB-scale. In this case, you cannot just download the whole thing. And in fact, this is a limiting factor for using our data. When you know if you click a button and it will take days to download and fill your harddrive, you won't click that button. So the use case here is to really enable others to explore the entire dataset without download it, and search and select a subset and download it. We want to allow then search and hand pick images, to only view or test with our models directly in bioimage.io. Keep in mind that there are even larger dataset, e.g. in the EM field.

In fact, new scalable file formats such as NGFF are built in a similar thoughts, they aggregate the meta info in separate files, store files in chunks and store multi-resolution pyramid. It enables a quick overview of a massive image by only download a low resolution version of it. The same thoughts applies here, we want to support quick overview of our samples without touching the actual sample.

I'm not really convinced that this is actually needed. It feels more like the dataset RDF could use a samples List[Sample] field to handle that task you just described. Wouldn't that be enough?

I agree with you here, what we need is indeed List[Sample], but my point is that the definition of Sample is not yet defined, and I doubt we can define it easily, so I propose to stick with Any for now.

bioimage-io / spec-bioimage-io

dataset RDF #153