Field for training data

akreshuk commented 2 years ago

It seems like we lost the dedicated field for the model training dataset and it is now absorbed into 'links'. Some considerations:

we claim traceability/reproducibility as one of the main advantages of using our common format
a link to the training data should be human-readable, everyone should understand what they are dealing with.

If it's a link to bioimage.io, it has to be the full link, but it can also link to zenodo or even to the original paper. I would even keep the "links" field with the direct bioimage.io link and still make a separate field like 'training_data' which takes a list of full links. Any opinions?

FynnBe commented 2 years ago

Maybe keeping it limited to links within bioimage.io is a good idea. We have no control over arbitrary external links and any such external dataset link could easily be 'wrapped' as a bioimage.io dataset RDF making it more discoverable, avoiding duplicates, etc...

akreshuk commented 2 years ago

Then let's have it as a full link, not the one with ilastik/bla like format. My reasoning would be that someone just looking through the yaml file should be able to know where the training data has been and it looks more understandable as a full link. Still, even just having it as a separate field would already make it better.

FynnBe commented 2 years ago

I think the end user is not going to the yaml much... we could also comment the links field in any rdf pointing out that these are in fact bioimageio resource ids... or mention this fact more often in our docs. (note: the final RDF we show on the website is generated and lives in the collection repo, thus adding comments, etc there is easy)

Why so reluctant with new specialized fields? The problem with the separate field is that we did/will come up with a lot of specific fields and they all need attention in the backend and different ways of visualizing in the frontend, etc...

oeway commented 2 years ago

Well, this is not how links is originally defined, the links should be the id to another item within bioimageio, you can think this is more for creating internal connections within bioimage.io.

There are other complications for rendering the links if we try to allow arbitrary links, e.g. we will have to display every link as the same generic link icon instead of a dedicated icon for each link (e.g. ilastik icon, HPA icon). We cannot easily generate a dedicated icon for each link for various reasons.

Keep in mind that the end-user of BMZ will only see the card, most likely, only software and developers read the yaml file.

It seems like what you are looking for should be a section in the README file. Otherwise, we need more strict definition on how such dataset links should be specified and also consider how we can actually consume the data, not just linking to a website. Ideally, a link to a dataset should allow our core library automatically download the data and run models on it. To achieve that, we will likely need to constrain it as a dedicated dataset item to be able to specify how to read the dataset. It also means, I would rather stick with the current way of specifying dataset (i.e. with the RDF id) (no arbitrary URLs). The point is, linking to another website or zenodo deposit won't automatically allow us to reproduce the model. We can potentially define other fields dedicated for datasets, however, not sure one field will do that, so I would rather use the links and define that in another dataset RDF.

(I also agree the points mad by @FynnBe )

constantinpape commented 2 years ago

Keep in mind that the end-user of BMZ will only see the card, most likely, only software and developers read the yaml file.

While I think that this is true. @akreshuk also has an important point here: the idea is to document on which data the model was trained and do this in a way that is explicit (irrespective of who will read the rdf.yaml). The problem is that currently we only store this implicitly in the links, without making apparent which of the links the data was trained on. (The implicit assumption is that a link that points to a dataset resource is the training data, but this is not very clear, because one needs to check each resource on the website, there could be multiple datasets).

Otherwise, we need more strict definition on how such dataset links should be specified and also consider how we can actually consume the data, not just linking to a website. Ideally, a link to a dataset should allow our core library automatically download the data and run models on it. To achieve that, we will likely need to constrain it as a dedicated dataset item to be able to specify how to read the dataset. It also means, I would rather stick with the current way of specifying dataset (i.e. with the RDF id) (no arbitrary URLs). The point is, linking to another website or zenodo deposit won't automatically allow us to reproduce the model. We can potentially define other fields dedicated for datasets, however, not sure one field will do that, so I would rather use the links and define that in another dataset RDF.

I also agree that this is a nice vision, but something that is not feasible in the current scope and something that should indeed be specified in the dataset RDF; hence we should somehow keep the link logic.

I see two options: a) either we completely stick with the link logic, but make more explicit that some of the links are for the training data

training_data: ["ilastik/livecell_data"] # links to training datasets that live on bioimageio. This can be directly inserted into the normal links to generate stuff on the card

or b) we go for a more sophisticated option that can also deal with resources that are not registed bioimage.io:

training_data:
  bioimageio_links: ["ilastik/livecell_data"]  # link to training datasets that live on bioimageio go here. This can be directly inserted into the normal links to generate stuff on the card
  # and optionally if we don't have bioimageio_links we could have other stuff to describe the dataset
  source: ...   # url to the dataset source
  citation: ...  # same as in the model format
  ....

I would personally prefer option a) for the reasons that @FynnBe and @oeway have laid out. This also increases the incentive to register datasets with bioimage.io; which however means that we need a good process for contributing these, as discussed with @esgomezm during the last meeting.

FynnBe commented 2 years ago

option c): more documented links in general

links: 
  - id: "ilastik/livecell_data"
    description: training data
  - id: "ilastik/livecell_data2"
    description: validation data

edit: for FUTURE: this might then later also be tagged such that we could rerun training programatically by looking for links with certain attributes... e.g. tag: training_data in additon to the description...

constantinpape commented 2 years ago

option c): more documented links in general

That's also a nice idea, but not quite as explicit about the training data, because we cannot fully rely on standardized description names.

EDIT: ok, if you add more explicit tags that would be solved.

FynnBe commented 2 years ago

option c): more documented links in general

That's also a nice idea, but not quite as explicit about the training data, because we cannot fully rely on standardized description names.

yes.. see my edit: we could formalize this in the future...

akreshuk commented 2 years ago

I would still rather have a separate field. The format shouldn't be 100% tied to the website. Imagine some facility person training a network for their friends and making it compatible through the python library - without uploading to the website in the beginning. They should already be able to specify the training set. @oeway , I don't think we need the full complexity with downloading the datasets programatically, etc, but I want to make a strong point that it's important to specify training data. We can't make it a required field, but it should at least not look less important than others. And eventually we could indeed enforce it.

constantinpape commented 2 years ago

I would still rather have a separate field. The format shouldn't be 100% tied to the website. Imagine some facility person training a network for their friends and making it compatible through the python library - without uploading to the website in the beginning. They should already be able to specify the training set.

I think an ok compromise for this would be something like option b) I suggested above, but with only a few fields:

training_data:
  description: "Livecell training data"  # always good to have a description
  bioimageio_links: [...]  # links on the bioimageio website like "ilastik/livecell_data", can be inserted into the links when rendering the card
  external_links: [...]   # general urls for training data that is not on bioimageio yet

And we would need to think a bit about the relation of the two link fields: is it allowed to give both? What does it mean if both are given?

akreshuk commented 2 years ago

In principle, we could see for ourselves if a link leads to bioimage.io or not, so maybe we don't need both fields. On the other hand, if we allow for more than one training set (and why not?), then we won't be able to provide both links without having two fields.

constantinpape commented 2 years ago

n principle, we could see for ourselves if a link leads to bioimage.io or not, so maybe we don't need both fields.

Yes, that's true. But it may be simpler to keep it separated for rendering things on the website cleanly and not relying on runtime checks whether something is a bioimage.io resource or not (e.g. if there is a typo, or temporarily unavailable it's unclear whether it should be a url or a modelzoo id). Anyway, I wouldn't have strong opinions on this and would leave it up to @oeway (if we decide to go for this option).

On the other hand, if we allow for more than one training set (and why not?), then we won't be able to provide both links without having two fields.

I would propose to have lists for these field(s) in any case, so this could also be achieved with a single field (please note the []s in https://github.com/bioimage-io/spec-bioimage-io/issues/391#issuecomment-1039532948)

constantinpape commented 2 years ago

We had some discussions about this now, and think that we have a good solution: we introduce the training_data as a new field, which contains a list of datasets (that the model was trained on). Here dataset can be two things:

an id that links to an existing bioimage.io resource (as we currently have it in links)
a description of the training data that follows the dataset rdf specification

The first case is the preferred option where the dataset is registered in the bioimage.io ecosystem already, and it should be compatible with the current website design; the id can just be appended to the links. The second case is also important though to enable developers to specify their training data if it is not in the bioimage.io ecosystem yet, either because the data is not registered with it yet or still private.

Example using ids

training_data:
  - id: "ilastik/livecell_data"  # first training data is a dataset provided by ilastik as a community partner
  - id: "10.5739/zenodo.123456"  # second training data set is  available via zenodo, this must have a valid rdf

Example using rdf fields

training_data:
  - source: "https:/nmeth.com/livecell"  # the source of this publication, e.g. url or even a local filepath for non-published data
    citaton: ...  # optional citation info
    authors: ...  # optional authors info
    description: The awesome dataset
    type: dataset  # required for a dataset description

The two types could ofc also be mixed, i.e. one dataset with id, the other with the rdf content. We should also make use of this opportunity to introduce an extra category for the dataset rdf (which for now only mandates type: dataset).

What do you think @oeway?

esgomezm commented 2 years ago

Hi there!

@akreshuk I fully agree with the point of specifying training datasets and even with explicitly saying that it is the training dataset and not only dataset, which can be interpreted in many different ways. However, as you can suspect, I'm always hesitant about adding yet another new field to the RDF. Seems that we will never get to the point of freezing a version and have all the consumers synchronise with it. Hence, if really needed it, I would go for the simplest solution by now, which seems to be the first one if I'm not wrong:

training_data:
  - id: "ilastik/livecell_data"  # first training data is a dataset provided by ilastik as a community partner
  - id: "10.5739/zenodo.123456"  # second training data set is  available via zenodo, this must have a valid rdf

Yet, dealing with training data properly and defining some standards to exploit it in the BMZ (card rendering, using the data online for training, inference or whatever interaction, even cross-compatibility of training schedules and so on) feels like opening the Pandora box. So it may be convenient to make this field simple and not spend too much time now on it as it will really need dedicated attention (+ many updates) in the future.

FynnBe commented 2 years ago

The consumers could fully ignore this field for now as, at this point, it serves documentation purposes only. For a dataset RDF nothing would change for now, we cannot use the data specified there programmatically at the moment; this would remain the vision for the future.

constantinpape commented 2 years ago

Hence, if really needed it, I would go for the simplest solution by now, which seems to be the first one if I'm not wrong:

We have discussed this, but came to the conclusion that we think that only allowing the ids is too limiting for now, because it means that datasets must be submitted to bioimageio, which has a couple of big drawbacks:

the process for submitting datasets to bioimage.io is still not quite clear (see discussion in the last meeting)
coupling model and dataset upload makes the whole process even more difficult for developers
it tightly couples the model description to the website (in the sense that we can only refer to training data registered with the website); this is something we want to avoid to argue that our model format could be adopted without the need to go through the website
there may be data that cannot be registered with the model zoo (either because it's still private or the authors explicitly object to it)

For these reasons we want to offer the second option of specifying the rdf content directly in the model, so that data that's not available via bioimage.io (yet) can be referenced.

Seems that we will never get to the point of freezing a version and have all the consumers synchronise with it.

On the consumer side I don't see any big issues with this field; it can be ignored for the purposes of running a model. For model export there is only the bioimageio.core functionality afaik, which we will update to support this. (Afaik the deepImageJ model exporter is currently not working; once updated it would be nice if it supports this, but could also ignore it for the time being (the field will be optional)).

(card rendering, using the data online for training, inference or whatever interaction, even cross-compatibility of training schedules and so on) feels like opening the Pandora box.

We agree that we don't want to open the points about online training, training schedules etc. now; this will be left to figure out for the future once there is explicit funding to work on this. For card rendering there is a pretty simple solution:

datasets with id will be treated just like links
datasets with the rdf content can be ignored when rendering the card for now (would be nice to also render this information eventually, but we don't need to tackle this problem right now)

esgomezm commented 2 years ago

The consumers could fully ignore this field for now as, at this point, it serves documentation purposes only. For a dataset RDF nothing would change for now, we cannot use the data specified there programmatically at the moment; this would remain the vision for the future.

Not sure about this. For example, ZeroCostDL4Mic's most important functionality is training. If this new field is available, I think it would be strongly recommendable for this consumer to fill this field for example.

On the consumer side I don't see any big issues with this field; it can be ignored for the purposes of running a model.

In the case of deepImageJ is not exactly like this. We changed the plugins interface to ignore unknown fields and avoid problems with new RDF version at the expense of not rendering the information in a polite and easily readable manner. The plugin checks that the RDF yaml has the correct format according to its version. If there's a new field and the version number has not changed, then it will output an error. So then, we come back to the point of not freezing the RDF version.

datasets with the rdf content can be ignored when rendering the card for now (would be nice to also render this information eventually, but we don't need to tackle this problem right now)

Ok but then, I would maybe ask the developer to copy this information in the README.md as well. I guess that if they decide to publish the model in the BMZ and they've added this info, they would also like it to be visible somewhere.

constantinpape commented 2 years ago

Not sure about this. For example, ZeroCostDL4Mic's most important functionality is training. If this new field is available, I think it would be strongly recommendable for this consumer to fill this field for example.

Yes, that makes sense. But it may be a good opportunity to tackle this in zero cost, since right now there is no training data descriptions in the uploads we have, e.g. in https://bioimage.io/#/?tags=&id=10.5281%2Fzenodo.5749843. And, as I said, this would be supported in bioimageio.core.build_model.

In the case of deepImageJ is not exactly like this. We changed the plugins interface to ignore unknown fields and avoid problems with new RDF version at the expense of not rendering the information in a polite and easily readable manner. The plugin checks that the RDF yaml has the correct format according to its version. If there's a new field and the version number has not changed, then it will output an error. So then, we come back to the point of not freezing the RDF version.

We would of course bump the version if we include training_data as a new field. But I understand that this would mean that deepimageJ needs to be updated. Same would probably be true for https://github.com/bioimage-io/core-bioimage-io-java. But these changes should be relatively easy since it's just to acknowledge that a new field is there without the need to parse it further. (And btw, this is another reason to start refactoring this to a central dependency in java rather than implementing it separately in every software.) In any case, I think this argument only pertains to whether we introduce a new field at all or not; once we decide to introduce it all consumers would need to make a small change to support it, regardless of whether we have support only for id or also for the full rdf content.

Ok but then, I would maybe ask the developer to copy this information in the README.md as well. I guess that if they decide to publish the model in the BMZ and they've added this info, they would also like it to be visible somewhere.

Yes, I agree that this should be part of the checklist when accepting models.

akreshuk commented 2 years ago

@esgomezm , I fully appreciate your point about the delays and such. However, if we don't do it, we lose one of the biggest arguments in favor of our approach: the traceability and reproducibility of DL models. I know it can be provided through links, etc, but it really doesn't look central enough without a specific field. In any case, we must have training data specified for all models we have there now - I can post-add it at the point of merging, from what I can see it's all trained on public datasets. But we have to set the right examples here.

oeway commented 2 years ago

Hi all, sorry for being late for the party. A rough browsing through the thread, it seems we have come to a solution now. Here is what I think:

My initial thought is that adding a new field without a concrete plan on how to make the model reproducible entirely via, e.g. the core library seems not very useful, given we can always provide recommendations that a README file should contain a section with detailed description of the training data and training procedure, or linking to a Github repo with complete reproducible code.
A training_data field would make sense. I understand @akreshuk 's point about the strength of having a dedicated field would put more emphasis on our effort in encouraging reproducibility. Having a training_data field will definitely do that and make it easier to tell the story.
It's hard to use a URL/link to describe training data in practice, but a dataset RDF might do if we design it properly. Think it a bit more, especially when we try to imagine how our HPA model could use the training_data field. Without considering the exact data sequence feed into the model while training, even only providing the entire dataset is hard in practice. For HPA images for example, we have ~12TB images hosted on the HPA website and IDR, we cannot easily provide a link that will give the entire set or a subset of data. We will need a free text README file with some scripts to describe how one can download the dataset and how we divided the traning/validation/test set. Even if we assume all the data are uploaded to a dedicated zenodo, a link to the dataset is not enough since multiple model might share the same big set and only use a subset for training. Nevertheless, if we consider all these different scenario in the future design of dataset RDF, and also combine with the NGFF/Zarr with kerchunk-like virtual store options, we will have a good chance to address these challenges. But of course, this is not the priority right now.

Overall, I think the proposal described by @constantinpape with id or RDF fields should work for now, and can be easily extended in the future. We will be able to also generate dataset RDF from model RDF and render as separate cards. However, it might also introduce some extra issues when sharing the same dataset across models (which should be rather common), e.g. the exact same dataset might described slightly differently within different models and prevent us from identifying the same dataset (e.g. for benchmarking models against the same models).

One suggestion though, I think we can make training_data a single id/RDF instead of a list. I understand the point of allowing a list since the model might be trained on multiple datasets. On the one hand, I think a list is not entirely accurate, it should be a set without order, since a standard training procedure will likely shuffle the samples from multiple dataset. On the other hand, I think in those cases, we can always describe a set of datasets as a single new dataset (even include description on how to do sampling during training in the future), meaning in the dataset RDF spec, we should consider building a new dataset from an existing list of other datasets. For now I would use a single id or RDF, and inside the RDF, we can use for example a collection key to include other dataset. This design will also make it easier for us to identify whether two models using the exact dataset + sampling scheme etc in the future.

Anyway, we also discuss it during the weekly meeting.

constantinpape commented 2 years ago

Thanks for your comments @oeway and I think we can reach a consensus on this fairly soon given your comment.

Overall, I think the proposal described by @constantinpape with id or RDF fields should work for now, and can be easily extended in the future.

Great! Just to assign proper credit, this proposal comes from @FynnBe, I just wrote it up after we discussed this together with @akreshuk and @k-dominik.

It's hard to use a URL/link to describe training data in practice, but a dataset RDF might do if we design it properly.

Just briefly, we all agree that the current proposal is not sufficient to fully capture the training process. We all want this to happen eventually, but it is not feasible with out current resources. So adding trainig_data is aspirational, but already helps to document the most important part of which dataset the model was trained on (indeed without being able to document exact subparts of the data, splits etc.; this needs to happen in the README and/or by linking the training scripts).

However, it might also introduce some extra issues when sharing the same dataset across models (which should be rather common), e.g. the exact same dataset might described slightly differently within different models and prevent us from identifying the same dataset (e.g. for benchmarking models against the same models).

Yes, we have also discussed this point and agree that it's a potential issue. For now, the best solution we came up with is to encourage converting datasets stored as RDF to a modelzoo resource and then switching to id when a developer submits the model. (And also check if there is a resource already that describes the data.)

One suggestion though, I think we can make training_data a single id/RDF instead of a list. I understand the point of allowing a list since the model might be trained on multiple datasets. On the one hand, I think a list is not entirely accurate, it should be a set without order, since a standard training procedure will likely shuffle the samples from multiple dataset.

I agree that it should be a set in principle, but afaik this concept does not exist in yaml. I think that a list is simpler than a single entry in practice: take for example a model that uses 2 datasets already available via id. It's straight forward to put this into a list, rather than constructing some extra object that identifies the union of two datasets. We can document that the list does not imply an order of the datasets.

Anyway, we also discuss it during the weekly meeting.

Sounds good. I think we can go ahead with specifing the dataset RDF (that will probably be very similar to the general rdf in the first iteration) and discuss the proposal and remaining questions like list vs. single item on Friday.

FynnBe commented 2 years ago

I agree that it should be a set in principle, but afaik this concept does not exist in yaml. I think that a list is simpler than a single entry in practice: take for example a model that uses 2 datasets already available via id. It's straight forward to put this into a list, rather than constructing some extra object that identifies the union of two datasets. We can document that the list does not imply an order of the datasets.

the only other alternative would be a dict, keys could be short descriptions of what role the linked dataset played in the training process: "validation data", etc... we could formalize this more in the future. We could always convert it back to a "random list" if we decide against the dict in the future.

oeway commented 2 years ago

I think that a list is simpler than a single entry in practice: take for example a model that uses 2 datasets already available via id. It's straight forward to put this into a list, rather than constructing some extra object that identifies the union of two datasets. We can document that the list does not imply an order of the datasets.

Not only, the set issue, I think the major drawback of making a list is that we won't be able to describe how we combine the datasets in the list to make the new datasets. Considering two cases: a) pretrained on dataset 1 then train on dataset 2; b) randomly draw sample from dataset 1 and 2. A list won't allow us to express both cases. And If we want to take 50% from dataset 1 and another 50% from dataset 2, we won't be able to do it easily with a list.

Considering sharing data across dataset is rather common, and we don't want to duplicate the storage, or re-upload an entire dataset just because of the difference in a few samples. This means we will have to support create new dataset from a set of existing dataset anyway.

Making it a single dataset RDF will be more expressive than a list and make sure we can easily cope with different cases in the future. And the rule is simple, if the model was trained with two datasets, you create a new composed dataset using the composition spec of dataset (to be defined).

FynnBe commented 2 years ago

... composition spec of dataset (to be defined).

this composition spec of datasets could also live in the training_data entry of the model RDF. (and wherever else we will have use for this composition of datasets). If it gets too complex we can still move it to its own RDF, but for now I would try to avoid spawning all sorts of explicit RDF types and develop it in the training_data field for now.

oeway commented 2 years ago

this composition spec of datasets could also live in the training_data entry of the model RDF. (and wherever else we will have use for this composition of datasets). If it gets too complex we can still move it to its own RDF, but for now I would try to avoid spawning all sorts of explicit RDF types and develop it in the training_data field for now.

It makes no difference if training_data only contains one dataset RDF. We won't need a separate training data spec, the dataset RDF will just do it. Trying avoid having multiple ways to do the same thing would be nice to avoid confusion and reduce maintenance effort.

And I would not make a different RDF type for different dataset cases,the type remains as dataset. But the spec will provide a few field to allow composition.

akreshuk commented 2 years ago

While I agree that it would be nice to be able to specify deep details about how the training datasets were combined, there is a limit to what we will be able to do in any case and the ultimate understanding/reproducibility is only possible with code. For me, the specification of the training dataset in the spec at its current level is more about being able to check that the network was indeed trained on the same structure of interest, at a comparable scale and in a sufficiently similar modality. I would therefore be happy if we could just specify a list/set/dict/collection/whatever and rely on the README or even network source code for details such as 90% from the first dataset and 10% from the second. I also agree that it would be nice to make our future lives easier when it comes to recognizing the same dataset in different models, but I find that having to create a dataset RDF before you even think about submitting the model is too much to ask of contributors. We will have developer services in the future, where such things will be introduced more gently, with human help.

constantinpape commented 2 years ago

I agree with @akreshuk here, let us not over-complicate things and focus on what we can achieve right now without going down the rabbit hole of specifying the full training procdure again. So I think we have two options:

a list of dataset_rdf_content or id
a dict of dataset_rdf_content or id

I think @oeway has made some good points for making it a dict; while it's not possible to prescribe the complete training procedure in the keys, having the option to specify things such as training, validation or unlabeled_pretraining could already be quite helpful. So I am ok with making this a dict.

FynnBe commented 2 years ago

👍 for dict as this is already implemented 😁 (#394) then we can examine what we/others will fill in those keys and let that inform the next step of formalizing it.

esgomezm commented 2 years ago

Hey, I do also agree with @akreshuk and @constantinpape on trying to make it simpler. For dict, would it imply defining new fields and determining what's their exact meaning? or is it just to get a naming for each dataset that allows mentioning them in the README?

oeway commented 2 years ago

Hi all, thanks for your response!

To be clear, I also want a simpler solution, but also future proof. I want our spec to be more extendable and easily upgraded with backward compatibility.

I also agree that it would be nice to make our future lives easier when it comes to recognizing the same dataset in different models, but I find that having to create a dataset RDF before you even think about submitting the model is too much to ask of contributors. We will have developer services in the future, where such things will be introduced more gently, with human help.

No, that's not what I meant, in all cases we are discussing here you don't need to create a dataset RDF before you submit, you create it on-the-fly inside training_data field. @akreshuk

Let's assume we want to make a composed dataset for the data science bowl nuclei segmentation with images in two modalities, I think it's a difference between the following two ways of specifying the training_data:

Here is the simple list case:

training_data:
    - fluorescence_nuclei_dataset # id or full RDF
    - HE_nuclei_dataset # id or full RDF

And here is what I propose:

training_data: # it's a single RDF and the minimal general RDF contains only 4 fields
  type: dataset
  name: DAPI IF and H&E nuclei
  description: a composed dataset with both fluorescence and H&E staining images for nuclei segmentation
  format_version: 0.2.2
  collection:
    - fluorescence_nuclei_dataset # id or full RDF
    - HE_nuclei_dataset # id or full RDF
  # We can extend the keys in the future to specify how we want to combine the dataset
  # Example:
  mixing_ratio:
   - 20%
   - 50%

If we save the above dataset into a separate RDF, then the entire thing becomes:

traning_data: composed_dapi_he_dataset

esgomezm commented 2 years ago

If we save the above dataset into a separate RDF

Do you mean in a separate RDF of type dataset?

@oeway also, in these examples where would you introduce the link/URL to the dataset in case it's needed? This was one of the reasons why the discussion started

oeway commented 2 years ago

Do you mean in a separate RDF of type dataset?

Yes, if you save it and submit it as a dataset RDF, basically copy the above content and simply store it as a RDF file, and submit to zenodo. Again, we don't have to do it, we support both on-the-fly creation in the model RDF and offline creating of dataset (submitted as a dataset RDF before submitting the model).

@oeway also, in these examples where would you introduce the link/URL to the dataset in case it's needed? This was one of the reasons why the discussion started

Oh, traning_data: composed_dapi_he_dataset (a more concrete example would be traning_data: zero/composed_dapi_he_dataset if we store it as a zerocost dataset) is the dataset link. One model one dataset, we define the composition logic inside the dataset, such that we only need to link to one dataset here for every model.

constantinpape commented 2 years ago

I understand your proposal now @oeway; I have two comments:

I am not sure if what you propose should be type: dataset. Technically it's a collection of datasets. I know that you can semantically argue that a colloection of datasets is again a dataset, but from a spec point of view that would introduce an "if/else" for type: dataset that should be avoided if possible from my point of view.
I think that we have to allow both id: HE_nuclei_dataset as well as source: https:/my_nuclei.com, cite: ... (i.e. the dataset rdf contents) to be in line with the requirements we laid out above. (And this adds another reason for having this as a different type, otherwise we would allow deep nesting...)

FynnBe commented 2 years ago

incorporating parts of @oeway proposal and everyones comments into my idea and explaining what I mean by "develop the composed-dataset in the model RDF.

First of all, how about (juts for now) naming what we talk about composed-dataset type. The vision is somewhat clear: We will have a composed-dataset RDF

training_data: # it's a single RDF and the minimal general RDF contains only 4 fields
  type: composed-dataset  # implicit for now
  name: DAPI IF and H&E nuclei
  description: a composed dataset with both fluorescence and H&E staining images for nuclei segmentation
  format_version: 0.2.2  # implicit for now
  datasets: 
    - ilastik/fluorescence_nuclei_dataset  # id or full RDF
    - name: ..
      source: https://my-data....
  # We can extend the keys in the future to specify how we want to combine the dataset
  # Example:
  # mixing_ratio:
  #   - 0.2
  #   - 0.5

training_data: # it's a single RDF and the minimal general RDF contains only 4 fields
  type: dataset  # implicit for now
  name: DAPI IF dataset
  description: another dataset
  format_version: 0.2.2  # implicit for now
  source: https://...mydata.hdf5

for now I suggest to only define the composed dataset in the context of the model RDF, once it matured a little we can make it a dedicated composed-dataset RDF... (this is mostly just to save a little on dev time)

FynnBe commented 2 years ago

summing up from today's meeting:

we add a composed-dataset RDF, which differs from the general RDF by:
- an additional mandatory field: datasets listing dataset id's or an in-place dataset RDF
in model RDF training_data has to be a bioimageio id or an in-place definition of either a dataset RDF or a composed-dataset RDF

oeway commented 2 years ago

Hey, thanks for the summary.

However, thinking it a bit further in the implementation of the website or a software that uses the training_data field, I doubt whether the best way is to split into two RDF types. Sorry about this, but I would like to kindly ask you to revisit this decision, and provide a slightly changed proposal to fix this.

For the website, we now support 3 types of RDFs: model, dataset, application (including bioengine apps and notebooks etc.) which are rather distinctive. If now we add a 4-th RDF type called composed_dataset, it corrupts the distinction (or we might want to go back to add notebook_application, bioengine_application etc). What even worse is that this will break our grouping logic for the tabs which assumes one RDF type one tab.

According to @constantinpape , the arguments for making two RDF types (dataset and composed_rdf) instead of one with two modes are:

I am not sure if what you propose should be type: dataset. Technically it's a collection of datasets. I know that you can semantically argue that a colloection of datasets is again a dataset, but from a spec point of view that would introduce an "if/else" for type: dataset that should be avoided if possible from my point of view.

I think that we have to allow both id: HE_nuclei_dataset as well as source: https:/my_nuclei.com, cite: ... (i.e. the dataset rdf contents) to be in line with the requirements we laid out above. (And this adds another reason for having this as a different type, otherwise we would allow deep nesting...)

For the if/else logic point, I think making two RDF types doesn't really remove the if/else logic entirely, we basically defer it to a higher level. For example, instead of checking whether it's a dataset RDF, we now need to check whether it's a dataset or composed_dataset. For the website, the datasets tab will display two rdf types which will create a lot of hassle to implement because we assume one tab is one RDF type. And everywhere involves datasets will need to add the dataset or composed_dataset logic, even if we don't really try to understand or consume the dataset in the website.

For the recursion/nesting issue, we can implement that also in a single RDF, e.g. by introducing something like is_composed: true, or mode: composed, and then for the composed dataset case, we say that we do not allow nesting another composed dataset under the datasets key. I would go for mode: composed akin to the run_mode in the model spec.

We have been doing this type of conditional spec for model RDF too, e.g. with different weights formats, pre-post processing etc. So having a if/else condition should not be prohibited in the spec.

Here comes my proposal:

For composed dataset:

training_data: # it's a single RDF and the minimal general RDF contains only 4 fields
  type: dataset
  name: DAPI IF and H&E nuclei
  description: a composed dataset with both fluorescence and H&E staining images for nuclei segmentation
  format_version: 0.2.2
  mode: composed # <-------------this is the way to say this is a composed dataset
  datasets: 
    - ilastik/fluorescence_nuclei_dataset  # id or full RDF
    - name: H&E nuclei dataset
      source: https://my-data....

For single dataset:

training_data: # it's a single RDF and the minimal general RDF contains only 4 fields
  type: dataset
  name: DAPI IF dataset
  description: another dataset
  format_version: 0.2.2
  source: https://...mydata.hdf5
  # mode: single # The mode key is optional for a single dataset, and the `datasets` key should not be allowed if mode is not `composed`

Fundamentally, they are just two different ways to provide data, it will be more compact to define them under a single dataset spec. And for the website (or other potential consumer software), it will be much easier to implement and more consistent with what we have right now.

edit: removed the "# implicit for now" comments

FynnBe commented 2 years ago

I'd be fine with this and would consider simply making source and datasets exclusive and dropping the mode field. under datasets only datasets with source would be allowed, avoiding infinite nesting....

FynnBe commented 2 years ago

thinking of the future of this format I think we would get conflicts when adding additional fields that are mode specific, e.g. something like the aforementioned mixing_ratio would not apply to source, etc... Thus is we do go down the one dataset RDF route, I'd suggest to extend the source/datasets fields, or better yet, as these fields are the point where the two variants of the dataset divide add two exclusive fields original and composed, which hold source and (original_)datasets respectively:

an original dataset:

type: dataset
name: DAPI IF dataset
description: another dataset
format_version: 0.2.2
original:
  source: https://...mydata.hdf5
  # here is space for future fields specific to orignal datasets! yay!

a composed dataset:

type: dataset
name: DAPI IF dataset
description: another dataset
format_version: 0.2.2
composed:
  original_datasets: 
    - ilastik/fluorescence_nuclei_dataset  # id or full RDF
    - name: H&E nuclei dataset
      source: https://my-data....
  # here is space for future fields specific to composed datasets! yay!

Instead of composed we could use derived (which would include cases like a subset of another dataset more naturally)

Pros of this approach:

a dataset is a dataset, some if/else can be dropped
reduced copy paste in dev as the variants certainly share some fields
slightly more elegant (taste?)?

Cons:

a dataset can be one of two variants, some if/else are needed
changes to one variant bump the format version of the other (because there is only one shared format version)
less explicit than to separate RDF specs

oeway commented 2 years ago

Thanks for looking into this @FynnBe

I don't have strong opinion beyond making it as a single dataset RDF.

However, I think even with exclusive keys, we should not drop the mode key. The reason is that it won't be straightforward to determine whether a dataset is a composed dataset or not. Imagine that we will have to do if 'datasets' in rdf, this is too implicit and not nice as if rdf.get('mode') == 'composed' in terms of readability.

Besides that, I have no strong opinion about your new proposal, but I slightly in favor of not having too deep nesting, so I would vote for using source for the single dataset, and datasets for the other case.

Something like this:

type: dataset
name: DAPI IF dataset
description: another dataset
format_version: 0.2.2
mode: composed
datasets:
    - ilastik/fluorescence_nuclei_dataset  # id or full RDF
    - name: H&E nuclei dataset
      source: https://my-data....

Also, do we have a better name for the non-composed dataset, original seems suggest more than single dataset. Let's say if a dataset is transformed somehow, or messed around by someone, is it still "original"? I would go for single or something else,

constantinpape commented 2 years ago

I feel like we are starting to over-engineer again. And as soon as we allow linking to other ids in a new dataset type we allow for recursion (even having a different type for composed datasets does not help much here for checking it statically, so I agree in so far that this does not offer a solution). I would personally feel more comfortable with a very simple solution for now. Either we just allow for a single dataset per model, or choose one of the dict or list options.

If you really want to go ahead with the more complex dataset type here are my thoughts:

a single type: dataset would be ok, as long as nesting of composed datasets is not allowed (although we cannot check this statically, since there can be other ids in datasets, which might very well be composed datasets. (for the same reason having a composed_dataset type also wouldn't really help)).
I do prefer the proposal by @FynnBe in https://github.com/bioimage-io/spec-bioimage-io/issues/391#issuecomment-1044861968 and I think the benefits of this modular design weight out the readability concerns regarding mode.
I agree with @oeway that single sounds better than original here

Finally, I would change from

composed:
  original_datasets: 
    - ilastik/fluorescence_nuclei_dataset  # id or full RDF
    - name: H&E nuclei dataset
      source: https://my-data....

to

composed:
  datasets: 
    - id: ilastik/fluorescence_nuclei_dataset  # id or full RDF
    - name: H&E nuclei dataset
      source: https://my-data....

i.e. shorten from original_datasets (or single_datasets) to datasets and change from ilastik/fluorescence_nuclei_dataset to id: ilastik/fluorescence_nuclei_dataset to make life easier for representing this e.g. in java.

oeway commented 2 years ago

I feel like we are starting to over-engineer again.

Sorry for starting the conversation. But if you agree with what we decided during the meeting, changing from two RDF types into one type plus two exclusive field should not increase the complexity of the spec, instead it simplifies the implementation as I explains. Not sure this make it look more over-engineering.

I would personally feel more comfortable with a very simple solution for now. Either we just allow for a single dataset per model, or choose one of the dict or list options.

I actually agree with this, the reason is that perhaps 99% of the model will only have one dataset. So we are actually spend lots of time debating something that are not so useful.

To not trash what we have made so far, I think we should only wrap the keys for the case of composed dataset, stick with the source key for the single dataset case.

For single dataset, we do this:

type: dataset
name: DAPI IF dataset
description: another dataset
format_version: 0.2.2
source: https://...mydata.hdf5

Note our existing datasets are already in this format, so it's also backward compatible. I would not wrap source key further as @FynnBe proposed.

For the composed dataset, I am fine with what @FynnBe and @constantinpape proposed to wrap it under composed:

type: dataset
name: DAPI IF and H&E dataset
description: a composed dataset ...
format_version: 0.2.2
composed:
  datasets: 
    - id: ilastik/fluorescence_nuclei_dataset  # id or full RDF
    - name: H&E nuclei dataset
      source: https://my-data....

I think the benefits of this modular design weight out the readability concerns regarding mode.

Well, to clarify, the additional key is not contradict with the modular design, my point is to make the dataset mode explicit, you can have a composed key and a mode key. Although being redundant, we won't need to infer the mode by looking for a particular key. You can think the same for the type key in all the RDFs, although we can potentially infer the model RDF by check the inputs/outputs, and bioengine apps by the source URL format, but it's much more easier and straightforward to compare the type key.

Nevertheless, adding the mode key is not a strong point, I am fine with not adding that, and the composed already looks obvious.

constantinpape commented 2 years ago

I think https://github.com/bioimage-io/spec-bioimage-io/issues/391#issuecomment-1045053402 is def. the best of the proposals for the compsed datasets so far. I would not add mode in this case, since it's redundant.

Just to explain my concern regarding references: In all the "composed dataset" proposals so far we have the problem that we either a) allow for referencing composed datasets resulting in nesting and recursion (which is bad, because it can lead to very complex datasets and even to infinite recursions) b) introduces constraints that can only be checked "at runtime" with all bioimage.io ids resolved (which may be very confusing for developers because their model will pass all local checks, but fail upon upload)

To explain this a bit more, the "extreme" case we want to avoid is the following:

# a composed dataset on bioimage.io
type: dataset
name: ComposedDatasetA
id: comp_ds_a
composed:
  datasets:
    - id: comp_ds_b
    ....

# another composed dataset
type: dataset
name: ComposedDatasetB
id: comp_ds_b
composed:
  datasets:
    - id: comp_ds_a
    ....

This is ofc a bit artificial, but shows the problem with composed datasets and referencing. Even without the issue of infinite recursion, this can lead to deep nesting which will be very confusing for understanding and rendering datasets etc.

Now to avoid this, we can say that a composed dataset may only have "simple" (i.e. single) datasets in the datasets list. The issue with this approach is that we cannot check it statically but only once all the bioimage.io ids are resolved (because only then we know whether a referenced dataset is simple or composed). And this is an issue that could potentially happen fairly often.

Solution b) (fail at runtime when referencing a composed dataset) is certainly preferable to a), but it's still not a very good design and I feel hesitant to introduce some new thing with a pretty obvious design flaw. This is why I would be more confident with a simpler solution for now, which is either only a single dataset per model or a list/dict of single datasets.

oeway commented 2 years ago

I think https://github.com/bioimage-io/spec-bioimage-io/issues/391#issuecomment-1045053402 is def. the best of the proposals for the compsed datasets so far. I would not add mode in this case, since it's redundant.

Good then!

@constantinpape I agree with your points, but we don't have to limit ourselves in doing static checks. We can always make http requests to pull the rdf.json file from the collection repo so we have all the summary fiels of the datasets. If we don't have internet, we skip the dynamic checks and fail back to the static check. Keep in mind most models will have a single dataset, and and there will be very few models has composed types. It will only become an issue if he/she 1) created a bad recursive composed dataset 2) don't have internet when running the spec validator. But once he/she want to submit the model, we will reject and detect the issue.

FynnBe commented 2 years ago

Solution b) (fail at runtime when referencing a composed dataset) is certainly preferable to a), but it's still not a very good design...

any referencing of any id needs runtime validation. the id could belong to a model or application and yet be referenced under training_data. This problem seems unavoidable, unless we prefix ids by type?! That opens a whole other pandora's box... We rely on runtime testing for models already, I wouldn't consider that a design flaw per se.

I'd be fine with an explicit mode (or maybe rather subtype?) field. It is trivial to statically validate. I would personally leave it out due to the redundancy, but I see the argument for making it more explicit. Anyhow, this field can easily be added or dropped as we like, so quick vote on that and let's move on with the juicier bits!

I'm happy you guys like the composed:dataset suggestion, I'd really like it if that structure was mirrored with single:source: A dataset with composed should not have source, because it is not single. So having single vs composed as the exclusive keys makes more sense to me and naturally allows the addition of future single fields. If the majority insists on source over single:source I'd be ok, hoping that we'll eventually land on single:source when adding that single:future_field ;-) additional note: We still might want a source field in a composed dataset, which would porbably just mean something different than single:source...

FynnBe commented 2 years ago

I propose to vote at this point for all the points we have no obvious consensus (👍 👀 👎 for yes, whatever, no):

let's use single:source rather than root:source

FynnBe commented 2 years ago

let's add an explicit mode (or similar, e.g. subtype) field

oeway commented 2 years ago

I'm happy you guys like the composed:dataset suggestion, I'd really like it if that structure was mirrored with single:source: A dataset with composed should not have source, because it is not single. So having single vs composed as the exclusive keys makes more sense to me and naturally allows the addition of future single fields. If the majority insists on source over single:source I'd be ok, hoping that we'll eventually land on single:source when adding that single:future_field ;-) additional note: We still might want a source field in a composed dataset, which would porbably just mean something different than single:source...

I appreciate the symmetrical, but I afraid for the user, the single: source key will look a bit counter intuitive. Unlike composed, when one see single you won't immediately know what we mean. And scarifying the simplicit of 99% of the datasets to in order to match a few composed models seems unnecessary. Plus we will need to upgrade all our existing datasets.

I understand that having source vs composed won't be too natural. But how about changing composed to composition. The difference is that we are actually define how does the composition look like. It seems to me that source vs composition will fit better.

constantinpape commented 2 years ago

@constantinpape I agree with your points, but we don't have to limit ourselves in doing static checks. We can always make http requests to pull the rdf.json file from the collection repo so we have all the summary fiels of the datasets. If we don't have internet, we skip the dynamic checks and fail back to the static check. Keep in mind most models will have a single dataset, and and there will be very few models has composed types. It will only become an issue if he/she 1) created a bad recursive composed dataset 2) don't have internet when running the spec validator. But once he/she want to submit the model, we will reject and detect the issue.

Yes, I agree with this mostly. The one thing that has me worried a bit: if the website becomes more widely adopted a pretty common use case might be to refer to several datasets of other models; and it will be invalid as soon as one of them is composed. So to understand what's going on this needs a fairly deep understanding of the spec and is much more complex than a simple error because an id does not exist. Anyway, I think this can be handled by raising informative errors etc. But it's something we would not need to worry about with a simpler spec. So I will not veto the current proposal from my side, but I think it would be good to have @akreshuk look at this proposal with a fresh pair of eyes before we go ahead with it.

For the open points: I'd vote not to include mode since it's redundant, and to keep the simple source structure in order to not confuse developers for the simple use case of adding a single dataset. But I don't have very strong opinions on either of this points.

FynnBe commented 2 years ago

But how about changing composed to composition

good idea 👍

Yes, I agree with this mostly. The one thing that has me worried a bit: if the website becomes more widely adopted a pretty common use case might be to refer to several datasets of other models; and it will be invalid as soon as one of them is composed. So to understand what's going on this needs a fairly deep understanding of the spec and is much more complex than a simple error because an id does not exist. Anyway, I think this can be handled by raising informative errors etc. But it's something we would not need to worry about with a simpler spec.

In this scenario a good error message should be fairly straigt-forward along the lines of "Nesting composition datasets is not allowed. You should list the individual datasets ('id1', 'id2', 'id3') instead of the composition 'idc'". The edge case of users starting to nest datasets in-place does not seem like a real issue...

constantinpape commented 2 years ago

Short update from discussion with @FynnBe and @oeway. We decided to go with the simplest possible solution for now: we just have a single dataset in training_data that can either be:

training_data:
  id: ilastik/livecell-data  # id of a resource on bioimageio

or

training_data:
  type: dataset
  source: https://my/traing/data.com   # can also be a local file path
  ...

The first option is preferred (will put this into the submission guidelines). This also means that we only support a single training dataset for now (if multiple are really needed a new dataset resource for this can be created).

The discussion about the composed type is still valuable and we may introduce this in the future, but for now we want to keep this part simple enough to move ahead quickly.

bioimage-io / spec-bioimage-io

Field for training data #391