Open FynnBe opened 3 years ago
As I commented here, I think it's too early to define spec for dataset, at this stage anyone can propose any sensible setting that are compatible with the general RDF spec, then use it and battle test it in real applications. At some point, we can then promote various of options for evaluation, discuss the best option to define as a config. I think it's totally fine to be a bit chaotic at the beginning, but that will give us a basis for converging.
Although I think it's too early for a spec, but I think it's definitely helpful to use this issue for discussing and getting different opinions.
I will share what I have for another project for single molecule localization microscopy dataset (I already mentioned in the attachments discussion).
Here is an real rdf file I have (https://sandbox.zenodo.org/record/876447):
type: dataset
name: 2CH test image
description: This is a test image
license: CC-BY-4.0
authors:
- name: Christophe L.
tags:
- STORM
- smlm
cite: []
links: []
covers:
- ./I1(COS)_CH/screenshot-0_thumbnail.png
attachments:
samples:
- name: I1(COS)_CH
views:
- config:
scaleX: 1
scaleY: 1
scaleZ: 1
pointSize: 5
distance: 4
fov: 16
pointSizeMin: 0
pointSizeMax: 12
'Total # of locations': 1152480
x: 1
'y': 1
z: 1
point size: 3
x min: 0
x max: 1
y min: 0
y max: 1
z min: 0
z max: 1
active 0: true
color 0:
- 255
- 28
- 14
alpha 0: 0.85
active 1: true
color 1:
- 0
- 255
- 255
alpha 1: 0.85
Fps: 51
files:
- I1(COS)_CH2_clathrin.xls
- I1(COS)_CH1_microtubules.xls
viewer_type: window
image_name: screenshot-0.png
files:
- name: data.smlm
size: 14637650
checksum: 293djf39ssf234s23423ef2324sdg34
id: 10.5072/zenodo.876447
if you just replace attachments
with config
in your example we have deal 😉
if you just replace
attachments
withconfig
in your example we have deal 😉
Well, yes, but then we don't need to even start the discussion. Let's say if I add an attachments
key to config
, the same issue persists, no?
Well, yes, but then we don't need to even start the discussion. Let's say if I add an attachments key to config, the same issue persists, no?
no, anything in config
is--per current specification--free for all... what to include there is totally up to you and the place to try out new fields, mechanisms, etc... and we do not expect to make these entries relying on config
cross-compatible at all.
no, anything in
config
is--per current specification--free for all... what to include there is totally up to you and the place to try out new fields, mechanisms, etc... and we do not expect to make these entries relying onconfig
cross-compatible at all.
I got that, what I meant is that it does pass the validator for the current version, but we still don't have a solution to display a list of attached objects, to any RDF. Again, we cannot display a list of raw URI to the user and we cannot pull them because each URI maybe in GB-size.
I got that, what I meant is that it does pass the validator for the current version, but we still don't have a solution to display a list of attached objects, to any RDF. Again, we cannot display a list of raw URI to the user and we cannot pull them because each URI maybe in GB-size.
The middle way would be a dict[Str, URI], where the string is a descriptive name for the URI so that the user knows what is behind the URI and we do not pull that URI up. That's a middle ground and easy solution to the nice card with n meta infos which I still think would be overkill with anything above 20 items, but that's just my user behaviour opinion.
Let's forget about changing attachments
key, and assume we now have another key called attached_items
for storing dataset.
The middle way would be a dict[Str, URI], where the string is a descriptive name for the URI so that the user knows what is behind the URI and we do not pull that URI up.
Well, first it does solve the readability issue, for example if we attach 1000 images to a dataset, we only see names like image-1.png, image-2.png image-3.png... , you see what I mean, we need a thumbnail in this case, potentially also size of the image etc. Another issue of using only a dict is that you cannot sort the items, which is important to reproduce the training of a model for example.
If you want to store a label to the sample, it's not possible, and if you want to store image-mask pairs for segmentation dataset, again not straightforward.
That's a middle ground and easy solution to the nice card with n meta infos which I still think would be overkill with anything above 20 items, but that's just my user behaviour opinion.
Why is that? The thing is that we don't necessary display all at once, with the metainfo in a dataset contain thousands to millions samples, we can display the first 20 items, and show a search bar which allow others to quickly search with meta info.
Is that really a use-case in a dataset? I can understand to give some samples to the interested user, but wouldn't a handful suffice (maybe in config:covers?)? I'd expect someone interested in a dataset to just download the whole thing (as a zip or any other format) and not handpick singular images out of it. Same goes for the sufficiently large n number of examples: Searching is nice and all, but then I already kind of know what is in the dataset to search for.
I'm not really convinced that this is actually needed. It feels more like the dataset RDF could use a samples List[Sample] field to handle that task you just described. Wouldn't that be enough?
Is that really a use-case in a dataset? I can understand to give some samples to the interested user, but wouldn't a handful suffice (maybe in config:covers?)? I'd expect someone interested in a dataset to just download the whole thing (as a zip or any other format) and not handpick singular images out of it. Same goes for the sufficiently large n number of examples: Searching is nice and all, but then I already kind of know what is in the dataset to search for.
I have a bit different perspective for seeing how to use these. Here is the thing, we are dealing with larger and larger dataset, take the HPA dataset for example, we have over 10TB of images in tiff format. And in our Kaggle image classification challenge, we used a subset of it, but still can go upto TB-scale. In this case, you cannot just download the whole thing. And in fact, this is a limiting factor for using our data. When you know if you click a button and it will take days to download and fill your harddrive, you won't click that button. So the use case here is to really enable others to explore the entire dataset without download it, and search and select a subset and download it. We want to allow then search and hand pick images, to only view or test with our models directly in bioimage.io. Keep in mind that there are even larger dataset, e.g. in the EM field.
In fact, new scalable file formats such as NGFF are built in a similar thoughts, they aggregate the meta info in separate files, store files in chunks and store multi-resolution pyramid. It enables a quick overview of a massive image by only download a low resolution version of it. The same thoughts applies here, we want to support quick overview of our samples without touching the actual sample.
I'm not really convinced that this is actually needed. It feels more like the dataset RDF could use a samples List[Sample] field to handle that task you just described. Wouldn't that be enough?
I agree with you here, what we need is indeed List[Sample]
, but my point is that the definition of Sample
is not yet defined, and I doubt we can define it easily, so I propose to stick with Any
for now.
It's been on the horizon for a while now: It seems like we need a specialized dataset RDF.
To be discussed:
To prevent this discussion from growing out of proportions, I would like to suggest that we open separate issues (that can be discussed with a subset of people) for specific aspects emerging here, so that we can keep an oversight here. Also I will try to keep a summary of what has been discussed and/or decided here:
RDF.attachments
: #148