[RDF] Make the spec for attachment less strict

oeway commented 3 years ago

The current spec only allows an URL identifier in attachment (contained in a list wrapped with a dictionary), for some cases it would be helpful to allow a dictionary in a list, e.g. for displaying a table in the info page with download link, name, description etc. it would be great if we can change this in the generic spec.

FynnBe commented 3 years ago

is this what you have in mind?

attachments:
  mytable: 
    - {colA: uri_A1, colB: uri_B1}
    - {colA: uri_A2, colB: uri_B2}

oeway commented 3 years ago

Yes! @FynnBe But it would be also good to allow any object to be stored in the array, not just a plain object with URIs, since it's about allowing different groups of any objects to be attached.

FynnBe commented 3 years ago

SGTM I would not allow further combinations like

attachments:
  mytable: 
    - {colA: [uri_A1a, uri_A1b], colB: {need: [to, go, deeper]}}

oeway commented 3 years ago

Well, it's not just a table, an attachment by definition should allow many different types, and we don't need to always render it. I would rather make the standard less restricted to make the general RDF more stable. The child RDF can define the types in attachments. E.g. the dataset RDF can use the attachment to contain files grouped into train, test. And the file definition can be a plain URL or some object that encode the offset info in a zarr object.

FynnBe commented 3 years ago

I think defining the dataset RDF should be a separate discussion (which we should have rather sooner than later, though!). As an example use case I would 'translate' your description into this example:

attachments:
    files:
        train: [uri1, uri2]
        test: uri3

(while ignoring the 'special zarr uri' for a moment)

Rather than just allowing any structure in attachments I would add structures as use cases for them pop up. A defined structure helps to keep the oversight (also as a user) and promotes reusability of any software dealing with it.

I would suggest that attachments is always a dict (as we rely on checking for the magic key "file", which we should document!) To cover the use cases discussed so far, we have (using python typing's Dict[key, value] notation): currently:

Dict[str, URI]
Dict[str, List[URI]]

in discussion:

Dict[str, List[Dict[str, URI]]] https://github.com/bioimage-io/spec-bioimage-io/issues/148#issuecomment-874064580
Dict[str, List[Dict[str, List|Dict]]] https://github.com/bioimage-io/spec-bioimage-io/issues/148#issuecomment-874117336)
Dict[str, Dict[str, URI]]] (example above)
Dict[str, Dict[str, List[URI]]]] (example above)

oeway commented 3 years ago

I think defining the dataset RDF should be a separate discussion (which we should have rather sooner than later, though!).

I have been experimenting different stuff for the dataset, but find the current general RDF is limited to express ideas. That's why I propose it here and mention the dataset.

For the attachments general RDF, I complete agree that we should define some sort of structure, but a basic structure for attachments without touching the details inside the object attached: Dict[str, List[Any]] It's up-to the child RDF to define what we do with the object format(either it's a URI or other stuff). It's simple and straightforward.

Further constrain on the type of the object would just complicate things or make our general RDF unstable if we had to adopt new cases.

Let me give you a real example here where I used general RDF store single molecule localization microscopy data samples.

Here I store the samples in a list, and each sample contains some files and some screenshots (views), what's important is that I needed to store the viewer parameters which the screenshot was taken so when the user click on the screenshot, we can recover the state in the viewer:

attachments:
  samples:
    - name: sample-1
      files:
        - name: data.smlm   #<---------------the actual data file, one sample can contain several files
          size: 14637650
          checksum: 'md5:c0c8e05a5df2103633f696ed1d10b17d'
      views:   #<---------------screenshots of the sample + viewer config
        - config:
            scaleX: 1
            scaleY: 1
            scaleZ: 1
            alpha 0: 0.85
            color 1:
              - 0
              - 255
              - 255
            alpha 1: 0.85
            files:
              - I1(COS)_CH2_clathrin.xls
              - I1(COS)_CH1_microtubules.xls
            viewer_type: window
          image_name:
            - screenshot-0_thumbnail.png
            - screenshot-0.png
          image: >-
            https://sandbox.zenodo.org/api/files/82d06ee2-bd73-4a9a-99e6-5e28739a9cee/Untitled
            Sample/screenshot-0_thumbnail.png
        - config:
            scaleX: 1
            scaleY: 1
            scaleZ: 1
            pointSize: 5
            distance: 4
            fov: 16
            pointSizeMin: 0
            pointSizeMax: 12
            'Total # of locations': 1152480
            x: 1
            'y': 1
            z: 1
            point size: 3
            x min: 0
            x max: 1
            y min: 0
            y max: 1
            z min: 0
            z max: 1
            active 0: true
            color 0:
              - 255
              - 28
              - 14
            alpha 0: 0.85
            active 1: true
            color 1:
              - 0
              - 255
              - 255
            alpha 1: 0.85
            Fps: 53
            files:
              - I1(COS)_CH2_clathrin.xls
              - I1(COS)_CH1_microtubules.xls
            viewer_type: window
          image_name:
            - screenshot-1_thumbnail.png
            - screenshot-1.png
          image: >-
            https://sandbox.zenodo.org/api/files/82d06ee2-bd73-4a9a-99e6-5e28739a9cee/Untitled
            Sample/screenshot-1_thumbnail.png

I understand that by defining stricter spec, you want to improve the interoperability, however, I think that's not possible without extra effort. Even if we allow only a URI or string inside the attachment list, there can be easily thousands of types of URI format, or I can serialize a json object to store it in the string to achieve the same goal. We won't get there even if we have a very well defined and constrained spec, so why not give more flexibility to make the fields more useful.

Even if we don't constrain the object type in the attachment, other software or website will at least know this general RDF contains X number of attachments. The alternative solution is to store these attachment with custom object types in config, then other software won't know anything. There is not fundamental difference, but it would be nice to use the key attachments to store attached objects.

FynnBe commented 3 years ago

The content of the views field is obviously very specific to this particular dataset. Thus, I would put all the content of views in a separate file and refer to it in the attachments.
With the files key nested somewhere in attachments it is very unclear how to treat it. This is where we need a separate discussion on what the dataset RDF really should look like and if it can be 'packaged' for example, or what that would mean. Ignoring the unclear dataset RDF status, I would write your example like this:

attachments:
  files:
      - {source: data.smlm, size: 14637650, md5: 'c0c8e05a5df2103633f696ed1d10b17d'}
      - data_views.yaml

there can be easily thousands of types of URI format

URIs are cleanly defined: https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax and used elsewhere in our spec

Even if we don't constrain the object type in the attachment, other software or website will at least know this general RDF contains X number of attachments.

I would argue that from a very free structure of attachments it is hard to gage the actual number of attachments and a stricter format alleviates this.

For completely free format (cryptic to any general purpose software) data we introduced the config field. Maybe we should separate attachments and config more clearly...

tomburke-rse commented 3 years ago

@oeway I'm with Fynn on this. the example you brought up is a whole new data structure, which I would expect to be in another file. Making this information available and parseable in Java looks like a huge pain, even with it being Dict[Str, Any] is already an absolute wildcard from a Java perspective.

oeway commented 3 years ago

Well, as you mentioned earlier, I think the point of being strict is to allow interoperability. Even if we enforce the URI format and move custom things to a compatible URI, it doesn't help at all, the content inside the URI be stored in any format -- in this case I don't think it's useful at all to constrain the attachments into a URI.

As illustration, here are the two cases we discussed:

RDF -- attachments ---> arbitrary object
RDF -- attachments ---> URI ---> arbitrary object

There are other practical reason for why we want to get rid of the unnecessary URI layer, e.g. Zenodo limit the file number to <100, so you don't want to store them into many other files. Plus for small meta info, you don't want to make a separate file, because of the overhead of requesting the files (each HTTP request will have a lot of transmission overhead).

@oeway I'm with Fynn on this. the example you brought up is a whole new data structure, which I would expect to be in another file. Making this information available and parseable in Java looks like a huge pain, even with it being Dict[Str, Any] is already an absolute wildcard from a Java perspective.

Well, we already have the config field which can contain anything. Plus it doesn't change if we store a URI and the URI link to another file which can be in arbitrary format, does it?

With the files key nested somewhere in attachments it is very unclear how to treat it. This is where we need a separate discussion on what the dataset RDF really should look like and if it can be 'packaged' for example, or what that would mean. Ignoring the unclear dataset RDF status, I would write your example like this.

The point here is that the RDF is not meant for any other software to read without special treatment. Even if we do as you proposed by storing the file as a separate yaml, it will be exactly the same for other software, plus they need to dig into another layer then find they cannot read it.

attachments: files:

{source: data.smlm, size: 14637650, md5: 'c0c8e05a5df2103633f696ed1d10b17d'}

data_views.yaml

Well, this is not what I want, I would like to store many samples (where each sample consists of several files and screenshots), I need to keep the hierarchy, because if we do as you proposed we won't know the data_views belongs to data.smlm. Also note that I only show 1 sample here, in fact there are many samples organized in the same way.

tomburke-rse commented 3 years ago

Yes, the config file has that exact same annoying problem and as far as I know, is currently not handled except for things produced in fiji and then consumed again. For any other model, nothing happens. A best guess type-casting is possible, but is at best a shoddy solution.

Also, a dict containing complex data does not feel like an attachment to me. I'd except a file/URI there. If you have many files which are meant to be together and contain data, I'd expect one file referencing them to each other in the attachments.

Like Fynn mentioned, we should probably not mix attachment and config (where I would see any kind of metadata) here.

oeway commented 3 years ago

Also, a dict containing complex data does not feel like an attachment to me. I'd except a file/URI there. If you have many files which are meant to be together and contain data, I'd expect one file referencing them to each other in the attachments. Like Fynn mentioned, we should probably not mix attachment and config (where I would see any kind of metadata) here.

Just to clarify, we are not discussing specifically for the model RDF here, the model RDF can have specific requirements for any field by overriding the definition in the general RDF.

The general RDF should be general enough to fit other different cases, plus adding the URI layer doesn't solve the actual issue, but introducing the overhead to force small meta info storing in separate files, which is fine for one or two files, but when we have a long list of attachments, storing everything in separate files introduces unnecessary burden. For example to solve the zenodo limit ( which support only <100 files), one solution to store all the attachment files in a zip, then store the byte offset, labels along with other meta info in the attachment object)

Why we want to have some config info to be attached with file, well, in many cases the file itself cannot contain all the information we want, for example, if we want to attach a list of 16-bit images, we would like to also save the cat/dog label, plus some contrast stretching parameters to allow better view the image. You may argue why not save into tiff, but there are many other reason to not do that. Allowing additional info for each file would just allow more flexibilities. Imagine if we only allow URI, we will have a yaml file store the file name + config, then each yaml link to the 16-bit png file-- which is completely unnecessary to have the yaml file layer when we already have a RDF yaml. In both cases, other software won't be able to read it any way, I failed to see what's the benefit of enforcing the URI format or other restricted format for attachment objects.

tomburke-rse commented 3 years ago

From a Java point of view only: The main difference is this: If I know it's a dict containing URIs, I can just show and list them and the user can then open/download/whatever with them. If it's some data structure, I'd have to guess which most likely is a string representation to show it.

At the bare minimum, it contradicts my understanding of what an attachment is.

I'm not saying we should definitely not do it, as I can see your use-case. I just want to have all limitations and implications on the table.

oeway commented 3 years ago

From a Java point of view only: ... I'm not saying we should definitely not do it, as I can see your use-case. I just want to have all limitations and implications on the table.

Thanks for your inputs here. I won't worried about this for now, because the general RDF are mostly being used for cases where the spec is not clearly defined, for example having some trials on different formats for storing dataset.

Once the trail is done, we can promote them as specific standard RDF. After that, we implement them in compatible software -- and by then if we implement in Java, you will see a clear definition of the fields (instead of any object type).

esgomezm commented 3 years ago

Hi all, @oeway I see your point about making attachments less strict. However, the attachment field is common to the model.yaml and the general RDF. For this reason, if we allow attachments of such complexity in the model.yaml, I wouldn't expect all the consumer software to be able to process such kind of information. This has two main drawbacks:

A model with the basic field (without any config), could be not cross-compatible.
I guess that the CI could sometimes, not be able to check what's the content of this field.

I guess not, but could it be possible to limit the attachment field of the model.yaml?

Here I store the samples in a list, and each sample contains some files and some screenshots (views), what's important is that I needed to store the viewer parameters which the screenshot was taken so when the user click on the screenshot, we can recover the state in the viewer:

This kind of application is great but the information of the attachment is more about the input information that this specific application needs to display or work with the SMLM screenshots. Please correct me if I'm wrong, but I see this as yet another use-case of the config field.

oeway commented 3 years ago

However, the attachment field is common to the model.yaml and the general RDF. For this reason, if we allow attachments of such complexity in the model.yaml, I wouldn't expect all the consumer software to be able to process such kind of information.

I guess not, but could it be possible to limit the attachment field of the model.yaml?

Yes, we can limit that if necessary. But since we already have config which allows any object, the consumer software would already fail if undefined field is a problem.

Two things I would like to clarify: 1) loose definition in the general RDF doesn't mean the model RDF will have the same loose definition, the model RDF can always add constrains on the object types in attachments, e.g. only allow URI if necessary. 2) Even if we share the same definition for the model RDF, most consumer software won't even try to read the attachments, it's only used by the packager at the moment, under a specific key called files.

Keep in mind allowing any object in the attachment list is one thing, whether we use it, check it and fail because of it is another thing. That is to say, if one store attachments in a wired format, he/she must have a reason to do that, as a consumer software why would you want to process such information? You can simply ignore that. It is the same reason why allowing any object type in config is not a problem. Also the same why we can add any markdown text in the documentation, human read it but the consumer software won't try to parse the content of it.

Overall, the change we are discussing here can be easily made not to affect the model RDF at all but it does help other RDF types.

Please correct me if I'm wrong, but I see this as yet another use-case of the config field.

Yes, we can store such info in config, and we can always do that if we think we shouldn't do allow this in the attachment. However, we also have the general guideline of if an existing field in the RDF fits your needs, then you should not use config.

Here the point I try to make is that, by making the definition of attachment less restricted, we can make the field more useful for cases where a list of files+config can be stored in the attachments. The config info is to complement the file URI.

For example, for the case of storing training dataset for image classification, the most sensible way to do is to store a label with the image file URI, ideally as a list in the attachments.

Another use case is a 'collection' type of RDF which can contain a list of other RDFs, we can add other RDFs as attachments. Adding the URI in the attachment list won't be enough because imagine if you have 1000 other RDF attached and want to render it on a website, to show the cards, you will need to send a 1000 http requests to pull info when the user click on the collection card-- this is slow and will exceed the server rate limit(e.g. zenodo will block you). To solve this, we need to cache the basic RDF info in the attachment directly, so we only need 1 http request and we can render the 1000 items. More detailed info can be requested when the user click on the specific RDF.

And again, enforcing URI does not help solving the actual problem but creating extra burden on how object are attached.

FynnBe commented 3 years ago

Another use case is a 'collection' type of RDF which can contain a list of other RDFs

We used to refer to this as a manifest, but with the recent changes in In https://github.com/bioimage-io/spec-bioimage-io/pull/150/files I renamed it to collection RDF: https://github.com/bioimage-io/spec-bioimage-io/blob/25cbba6001fb1e7ecd34baa9236981e2a907f749/bioimageio/spec/v0_3/schema.py#L848-L853

Neither Manifest, nor Collection have used attachments, so I would not consider this a use case.

IMHO a core problem is that we rely on attachments:files for our packaging logic, etc.. This is a) not documented and b) impossible to validate. However we move forward with attachments I would appreciate if we could disentangle the file attachments from it. Possibly even defining attachments to what is now attachments:files or by separating it into file_attachments. Either approach makes documentation and validation much easier.

Atm, I don't see any reasonable use-case outside of a 'dataset RDF', for which we need a separate discussion: #153

When introducing attachments the notion was that it contains additional information, loosely related to the resource. We did not specify it further as we assumed it is not required to 'run the model'/'work with the content of the RDF', etc. We broke this already by heavily relying on attachments:files (which is problematic as noted above).

(TL;DR) My conclusion: Thinking about this for a bit now, I would vote for redefining attachments as URIs (+local relative Paths) in a more or less nested way (plain list, dict, some limited combination thereof) and move anything else to config and discuss incorporation of any potential new fields into an RDF from there.

oeway commented 3 years ago

Neither Manifest, nor Collection have used attachments, so I would not consider this a use case.

Well, that's just a legacy implementation with the manifest. As we discussed in a previous meeting, while migrating to zenodo, I would like to use generic RDF also for collections, basically to unify all the implementation. Therefore, this use case is going to be valid.

A general thought here about defining new spec, I think defining a new spec without testing it in real application is the wrong way to go, although we can say we are smart and experienced, but no one can foresee things without actually testing. The same apply for dataset and other spec, it's too early to define that, as we don't have a single working example and concrete use cases. I afraid that we will waste time on debating cases which no one actually tried it.

I hope we can change the way how new spec options are proposed and evaluated, one should come up with concrete examples and actually tested it, ideally have different options applied. To make that work, we need to have a more inclusive general RDF spec which allows experimentation of different options.

Regarding the use of attachments:files, I don't see this is a pressing issue, we don't need to define a spec before using it (since it's still a valid RDF), there can be always conventions that we follow or recommend. And right now, it's good that we are already testing it in the packager, but I don't think we have tested that enough. I would rather promote that into the spec after we have used that more. Moreover, I would only add that for the model RDF spec, not the general RDF.

@FynnBe I would like to hear your thoughts on how you would address the practical issues with URI I mentioned in my previous comment. If we agrees that we should not allow any object in the attachments of general RDF, I would perhaps not supporting all those different combinations either, since too many branch cases will make it hard to parse and read the attachments. I would keep the current definition, only allow a list of URI.

A side remark on this, being too restricted is actually not helping to regularize the RDFs, it's actually quite the opposite for general RDF. For example, for the several cases I talked about, I might have to invent a new key with similar meaning, place them inside or outside the config.

FynnBe commented 3 years ago

TL;DR: Not allowing to directly insert undefined data in attachments keeps the RDF complying content separate from the external and experimental stuff. Experimental entries go to config Experimental files (used by code running the config stuff) are referenced in attachments as they otherwise would be lost.

Details:

As we discussed in a previous meeting, while migrating to zenodo, I would like to use generic RDF also for collections, basically to unify all the implementation. Therefore, this use case is going to be valid.

I may have missed that one meeting, but it is not obvious to me how the generic RDF can be used as collection RDF. However, if we do use the attachments field for this purpose, URIs in there would be sufficient. In other words: IMO the nested RDF should exist on its own in a separate file.

A general thought here about defining new spec, I think defining a new spec without testing it in real application is the wrong way to go, although we can say we are smart and experienced, but no one can foresee things without actually testing. The same apply for dataset and other spec, it's too early to define that, as we don't have a single working example and concrete use cases. I afraid that we will waste time on debating cases which no one actually tried it.

I hope we can change the way how new spec options are proposed and evaluated, one should come up with concrete examples and actually tested it, ideally have different options applied. To make that work, we need to have a more inclusive general RDF spec which allows experimentation of different options.

The definition of the 'collection RDF' is merely capturing current practice. We can move away from it and define collections/manifests differently. I described the de facto 'collection RDF' as we are using it actively in the validator, e.g. we are currently validating manifest files in CIs... This requires the validator to traverse the inner RDFs, while not following every URI it encounters.

A side remark on this, being too restricted is actually not helping to regularize the RDFs, it's actually quite the opposite for general RDF. For example, for the several cases I talked about, I might have to invent a new key with similar meaning, place them inside or outside the config.

I would argue that neither specifying a dataset nor a collection of RDFs are generic use cases of the 'general RDF'. This does not mean they need specific RDF definitions, but they certainly need their own RDF.type value. if the fields of the 'general RDF' are sufficient to describe a dataset/collection, the type can indicate how to interpret the data, e.g. "What do URIs in 'attachments' point to?".

@FynnBe I would like to hear your thoughts on how you would address the practical issues with URI I mentioned in my previous comment. If we agrees that we should not allow any object in the attachments of general RDF, I would perhaps not supporting all those different combinations either, since too many branch cases will make it hard to parse and read the attachments. I would keep the current definition, only allow a list of URI.

Restricting attachments to a list of URIs (and treating it like we treat attachments:files) sounds good to me 👍. Then attachments has a clear purpose: files to include in zip package.

About "including undefined" vs "URIs to "undefined": intent of URIs is clear: an additional file this resource "somehow" depends on. The "somehow" may be undefined for the RDF, but it is the resource developers task to make sure it does not collide with the RDF. (examples below)

Any feature in development being specified in config can use this mechanism to make sure files are present. As we aim at cross-compatibility we should strive to keep any RDF as strict as possible. Including or pointing at data in a special format will always interfere with usability, but we need the flexibility to develop. This is what config is for and in my view attachments formalizes packaging of future/experimental files.

Additional text you might want to skip reading:

examples for attached files I can think of:

models with source code might use attachments to include additional source code files or other files the model code depends on. As they will be packaged, the source code in the specified source file can assume their presence. The following may be included in the package if small (otherwise should just be linked to in description or documentation):
- a 'demo.py' script hinted at in the documentation. This is understandable by a human user, but the 'demo.py' is none of the RDF/software concern. All we guarantee is to include in the package.
- similarly: 'super_special_in_depth_doc.md' or
- a training script

or for @oeway 's dataset example:

...
type: dataset
attachments:
  - data.smlm
  - samples.yaml

config:
  - my_new_dataset_format:
      samples: samples.yaml

However, it is difficult to discuss a datset case without a clear idea of how a dataset RDF (or a general RDF with type: dataset) is used. I can imagine us adding something like the run_mode in the model RDF (dataset_type?) to include several ways of defining a dataset... -> #153

oeway commented 3 years ago

However, if we do use the attachments field for this purpose, URIs in there would be sufficient. In other words: IMO the nested RDF should exist on its own in a separate file.

I think you didn't get what I described previously about the 1000 requests problem, see below: Another use case is a 'collection' type of RDF which can contain a list of other RDFs, we can add other RDFs as attachments. Adding the URI in the attachment list won't be enough because imagine if you have 1000 other RDF attached and want to render it on a website, to show the cards, you will need to send a 1000 http requests to pull info when the user click on the collection card-- this is slow and will exceed the server rate limit(e.g. zenodo will block you). To solve this, we need to cache the basic RDF info in the attachment directly, so we only need 1 http request and we can render the 1000 items. More detailed info can be requested when the user click on the specific RDF.

I understand where you are from, but there are practical issues you only notice when you implement stuff with the spec.

intent of URIs is clear: an additional file this resource "somehow" depends on. The "somehow" may be undefined for the RDF, but it is the resource developers task to make sure it does not collide with the RDF.

It is true that a URI can link anything, but it cause issues for rendering from the website, if we want to simply render each URI on the website, you need extra meta info to do that. Again you cannot pull thousands of URIs just for rendering, and you cannot just show a URI which is not necessary human readable, you need meta info for that.

FynnBe commented 3 years ago

I think you didn't get what I described previously about the 1000 requests problem, see below:

maybe. As far as I understand this is a technical problem which we can solve with cashing. I don't think it would be desirable to maintain any collection with 1000+ entries manually. Pulling this together and cashing (with a CI) sound like the way to go here.

It is true that a URI can link anything, but it cause issues for rendering from the website, if we want to simply render each URI on the website, you need extra meta info to do that. Again you cannot pull thousands of URIs just for rendering, and you cannot just show a URI which is not necessary human readable, you need meta info for that.

you don't want to render a nested dict/list/undefined structure with 1000 elements either though... Why do we have to render the whole thing? And why wouldn't a URI suffice? A URI can always be rendered with an abbreviated hyperlink... Isn't this an arguemnt for the URI only direction?

I think these are really secondary concerns and should not determine how we define the RDF

oeway commented 3 years ago

maybe. As far as I understand this is a technical problem which we can solve with cashing. I don't think it would be desirable to maintain any collection with 1000+ entries manually. Pulling this together and cashing (with a CI) sound like the way to go here.

Well, the entire bioimageio is all about technical problems.

it's good that you mentioned caching, my proposal is also designed for caching. Think about it, after caching, it would be nice if we can still produce a valid RDF. Now where do you store the cached information? In an arbitrary config field? Or you can store it right in the attachments together with the URI? I would do the later, so we will have something like this in the RDF with cached meta info:

attachments:
  notebooks:
    - id: Dataset_Noise2Void_2D_ZeroCostDL4Mic
        name: Noise2Void (2D) example training and test dataset - ZeroCostDL4Mic
        description: Fluorescence microscopy (paxillin-GFP)
        doi: https://doi.org/10.1101/2020.03.20.000133
        authors: [Aki Stubb, Guillaume Jacquemet, Johanna Ivaska]
        documentation: >-
        https://doi.org/10.5281/zenodo.3713315
        tags: [Noise2Void, denoising, ZeroCostDL4Mic, 2D]
        source: https://doi.org/10.5281/zenodo.3713315
        covers:
        - https://github.com/HenriquesLab/ZeroCostDL4Mic/raw/master/Wiki_files/N2V_wiki.png

    - id: Dataset_Noise2Void_3D_ZeroCostDL4Mic
        name: Noise2Void (3D) example training and test dataset - ZeroCostDL4Mic
        description: Fluorescence microscopy (Lifeact-RFP)
        authors: [Guillaume Jacqueme]
        documentation: >-
        https://doi.org/10.5281/zenodo.3713326
        tags: [Noise2Void, denoising, ZeroCostDL4Mic, 3D]
        source: https://doi.org/10.5281/zenodo.3713326
        covers:
        - https://raw.githubusercontent.com/HenriquesLab/ZeroCostDL4Mic/master/Wiki_files/TrainingDataset_ShowOff_v3.png

you don't want to render a nested dict/list/undefined structure with 1000 elements either though... Why do we have to render the whole thing? And why wouldn't a URI suffice? A URI can always be rendered with an abbreviated hyperlink... Isn't this an arguemnt for the URI only direction?

Well, the point is not about 1000 files, even if we have 2 files, say, if each file is 1GB, there is it won't be possible to download the two just for render it in the webpage. The critical part is that we need that meta info to be able to render it nicely into a table or a RDF card for each attachment. In the minimal case, we want to be able to show at least the name of this object behind the URI without actually download the content from the URI (URI can simply be a random string, which may not readable at all).

Look,at this example below, it's a valid but long URI generated from an S3 server. Think how can we show it on the website:

attachments:
  files:
    - https://imjoy-s3.pasteur.fr/public/image.ome.tif_offsets.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=imjoy%2F20210707%2F%2Fs3%2Faws4_request&X-Amz-Date=20210707T144821Z&X-Amz-Expires=432000&X-Amz-SignedHeaders=host&X-Amz-Signature=0811c891f40928aa40ffa0bd0cbb679886f08ec3e8e3e218c06b066c26350163

To be able to render it, I would need it to be like this (e.g. after caching by a CI server) to be able to render it on the website:

attachments:
  files:
    - {name: "Plant Root", "size": 10200, "author": {"name": "Wei Ouyang", "affliation": "KTH"}, download_url: "https://imjoy-s3.pasteur.fr/public/image.ome.tif_offsets.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=imjoy%2F20210707%2F%2Fs3%2Faws4_request&X-Amz-Date=20210707T144821Z&X-Amz-Expires=432000&X-Amz-SignedHeaders=host&X-Amz-Signature=0811c891f40928aa40ffa0bd0cbb679886f08ec3e8e3e218c06b066c26350163"}

Now, how can one understand it if the format of the object is undefined. For now, we will follow our own convention, for dataset I do this, for collection I do that, until we test it enough, we define it clearly as a child RDF spec.

oeway commented 3 years ago

Regarding what kind of number we are aiming at, I don't see the reason why we should not be able to store a large number of items as attachments. I am actually plan to make a dataset RDF with 80,000+ images from the public HPA website, and because we want to show it on the website, having a URI wont' be enough, because each image consists of 4 image channels, with ID, gene etc. meta info, plus, we need also the thumbnail images to facilitate the visualization of the dataset.

Even if we have only 1 image, each of them are around 27MB, which is already not practical to display.

Think about it, if we show a dataset card, is it nicer to show a list of URIs which one can click one by one, or we can generate a gallery view, with cover image + simple meta info.

constantinpape commented 3 years ago

Sorry, I did not have time to go through all this, but I think we should really focus on making what we have right now work! I just started to work with the spec module again after 2 weeks of pause and it's in a state where nothing is working out of the box and many things are inconsistent. (I will make an issue collecting all the errors I encountered in a sec) I would very much suggest we postpone all discussions of future changes until we are in a more stable state.

oeway commented 3 years ago

Agreed, it seems it's hard to make an agreement on this.

The other thing is that the RDF spec 0.2.0 we had previous actually doesn't limit it to a list of URI, I actually made it to be general and define attachments as :

a group of attachments, list of resources grouped by keys

Now because I try to use the validator find it changed the original definition, no idea how it evolved to be the current version which put lots of constrains on it and breaks my previous stuff.

There are two options for now, either stick with RDF 0.2.0 for now (but the problem is the authors field is not up to date), or revert the changes to the spec for the attachments. Well, at least only lift the restrictions before we have an agreement on it. @FynnBe What do you think?

Edit: I am also a bit confused about what are the current version number of general RDF? Didn't we agreed to make it independent versioning.

FynnBe commented 3 years ago

Edit: I am also a bit confused about what are the current version number of general RDF? Didn't we agreed to make it independent versioning.

yes we did and it is at 0.2.0: https://github.com/bioimage-io/spec-bioimage-io/blob/cf9e737d9511995c1dce5c73fbd93fb6f3201f4e/bioimageio/spec/v0_3/raw_nodes.py#L24 But good that you bring this up, I will fix the wrong "0.3.2" in the docs

edit: fix: https://github.com/bioimage-io/spec-bioimage-io/pull/158

a group of attachments, list of resources grouped by keys

to be closer to your original definition of attachments, would it suffice to allow the following formats as '0.2.0' and take it from there? This encompasses the mentioned examples in your original description of RDF 0.2.0

Dict[str, URI]
Dict[str, List[URI]]
List[URI] # currently in validator
URI # currently in validator

I would like to avoid allowing anything right now only because there a very few older resources that were written when things were still very much in early development and in no way "released", but if this is too inconvenient for you we can of course just allow anything for now and restrict it with the next version bump.

oeway commented 3 years ago

to be closer to your original definition of attachments, would it suffice to allow the following formats as '0.2.0' and take it from there?

For me, the best thing is to stick with the original definition to allow anything. The RDF 0.2.0 has exists for quite a while and I have used the same in two other projects (one of them is I mentioned above).

For whether we should add these restrictions for the general RDF, I am not connived yet, but we should definitely continue the discussion. It might be helpful if we continue the discussion in the other thread you opened for dataset: https://github.com/bioimage-io/spec-bioimage-io/issues/153 Since there we have a more concrete use case.

oeway commented 3 years ago

I was thinking how we can impose some validation while allow the flexibility to do what I wanted to do with the attachments.

Another thought is that we can allow the attachment list to be a list of general RDFs if we can implement some sort of recursive validation. If that works, it solves the issues of not validated at all but also make my current implementation fit into the schema.

Plus, this implementation will accurately capture what we defined in the spec 0.2.0 (i.e. a group of attachments, list of resources grouped by keys).

This means we allow both Dict[str, List[RDF]] and Dict[str, List[URI]]. What do you think?

@constantinpape @FynnBe @tomburke-rse @esgomezm

FynnBe commented 3 years ago

the identical nested schemas are a bit of a problem (also implmementation-wise.. certainly doable somehow, but not straight forwrad). However, I do like this direction very much. Would it be sufficient to do this for a "collection RDF" only and disallow nesting a collection RDF? Then we add the special "collection RDF" Which is only special in the allowed type value and allow nesting any other RDF in it.. This is limited to depth 1 (which covers all our use-cases I hope) and gives no interpreter nor programmer a head-ache....

oeway commented 3 years ago

the identical nested schemas are a bit of a problem (also implmementation-wise.. certainly doable somehow, but not straight forwrad). However, I do like this direction very much. Would it be sufficient to do this for a "collection RDF" only and disallow nesting a collection RDF? Then we add the special "collection RDF" Which is only special in the allowed type value and allow nesting any other RDF in it.. This is limited to depth 1 (which covers all our use-cases I hope) and gives no interpreter nor programmer a head-ache....

That's good, can you not limit the type and recursion level (or support very deep recursion)? I will implement these limitation from the website itself.

In general, where to lift the limitation is implementation detail. Similar to the type discussion here, we do need to validate the exact types and levels etc., however, I would rather not do these validation in the python pacakge, but let the website to decide what specific type of RDF are supported, as well as the depth limitation. These are web site/consumer-specific feature that should be decided by the website, and they can change rapidly. Having the constrains implemented elsewhere is just causing maintenance burden. And I feel like tightly couple the python spec and the website very tightly is a not a good idea.

oeway commented 3 years ago

FYI: looks like we can easily implement recursion in JSON schema: https://json-schema.org/understanding-json-schema/structuring.html#recursion

And same for marshmallow: https://marshmallow.readthedocs.io/en/stable/nesting.html#nesting-a-schema-within-itself

FynnBe commented 3 years ago

And same for marshmallow: https://marshmallow.readthedocs.io/en/stable/nesting.html#nesting-a-schema-within-itself

note the additon of 'exclude'.... thus there is a difference between the root and the nested schema, although they use the same definition.

FynnBe commented 3 years ago

That's good, can you not limit the type and recursion level (or support very deep recursion)? I will implement these limitation from the website itself.

The single-source-of-truth idea we follow for a while now would suggest not to implement any limits on the website, but rather define everything here, no?

FynnBe commented 3 years ago

In general, where to lift the limitation is implementation detail. Similar to the type discussion here, we do need to validate the exact types and levels etc., however, I would rather not do these validation in the python pacakge, but let the website to decide what specific type of RDF are supported, as well as the depth limitation.

The problem with this approach is that the validator alone will then be unable to reliably validate a model outside of the bioimage.io website. But exactly this is one of its designated use-cases.

These are web site/consumer-specific feature that should be decided by the website, and they can change rapidly. Having the constrains implemented elsewhere is just causing maintenance burden. And I feel like tightly couple the python spec and the website very tightly is a not a good idea.

We strive for a single source of truth. I partially agree here with you, "Having the constrains implemented elsewhere is just causing maintenance burden". The spec and its validation should be a clear and full description of our specification that can be used in many places but is defined only once. If we make the spec and its validation configurable it becomes unclear what it means to adhere to the spec in the first place.

FynnBe commented 3 years ago

If the spec needs adaption for another project I would recommend to fork spec-bioimage-io and adapt it to the other project's needs. In the context of the other project it can be decided to what extent it needs to be compatible with the bioimage.io model zoo, but here we should focus our efforts on bioimage.io and not on a new spec/validation tool for general purposes.

oeway commented 3 years ago

The single-source-of-truth idea we follow for a while now would suggest not to implement any limits on the website, but rather define everything here, no?

Yes, we do have such discussion, but mostly in the context of model spec which makes a lot of sense. For the general RDF however, the previous deal is to first host the general RDF in json schema, then we decided to use marshmallow to generate the json schema along with the document. I expected it will be an implementation details which should not matter much, but the real issues comes when the marshmallow implementation adds lots of restrictions which are not defined by the original spec 0.2.0. And many of them make the current implementation on the website not comply with the marshmallow schema. And I had to argue many of the very technical and implementation related details here with you guys. Honestly, I think we are wasting time on premature technical discussions, the spec is one thing, but if I ended up with wasting more time on convincing you to change the spec implementation and not working on the website, it's not a good sign.

The problem with this approach is that the validator alone will then be unable to reliably validate a model outside of the bioimage.io website. But exactly this is one of its designated use-cases.

To be clear, I think we should do what we are doing right now for the model spec, as restricted as possible as you wanted. But we should not do the same thing for the general RDF which in only used by the website, here I think we should let the genreal RDF to be general, and let the website to implement the restrictions.

constantinpape commented 3 years ago

As far as I understand we are discussing two main things rn:

Is the python validator the single source of truth (also for the general rdf spec), i.e. does it do strict validation?
How do we extend the general RDF, e.g. do we allow nesting of other RDFs.

These are important discussions that we need to have, but not relevant for the 0.3.2 release; for this one we decided to go with the less restrictive approach already.

I think we need to start having these discussions more focused on a release by release basis, otherwise we keep on discussing future options without making any actual progress.

So my take: can we release 0.3.2 now? What rdf "sub-versions" will it have, I think model 0.3.2 and general 0.3.0, correct?

Let's do this first and then continue on to future spec discussions.

FynnBe commented 3 years ago

I think we are still discussing general 0.2.0 as my interpretation of it did not fit @oeway 's implmentations so far and it makes more sense to to honor his understand of it by changing the spec of general RDF 0.2.0

FynnBe commented 3 years ago

here, the remaining open detail is the attachements field. which we can make a free dictionary, but as @oeway suggested as a compromise, make as nested RDFs (or URIs)... I think this would be great, the only problem being infinite nesting. And that's even a practical issue as far as I understand it.

constantinpape commented 3 years ago

I think we are still discussing general 0.2.0

I think it should be 0.3.0, because we changed the authors field compared to 0.2.0

suggested as a compromise, make as nested RDFs (or URIs)... I think this would be great, the only problem being infinite nesting

I would very strongly vote for not doing this now and stick with the decision we have come to in the Friday meeting, i.e. go with the less restrictive version (i.e. arbitrary dict) now and work on more restrictions in the future releases.

oeway commented 3 years ago

As I understand, these discussions for the general RDF should not affect the model spec. The attachments for example, the model spec only used a list of files for packaging.

For the course on Friday, we have already settled the details for 0.3.2 and all these are implemented already in the validator, no? @FynnBe If not we should definitely prioritize that.

constantinpape commented 3 years ago

For the course on Friday, we have already settled the details for 0.3.2 and all these are implemented already in the validator, no?

Well yes, but we haven't made the actual 0.3.2 release yet. And I think that should be the absolute priority.

FynnBe commented 3 years ago

maybe https://github.com/bioimage-io/spec-bioimage-io/pull/172 can fix this for this release...

FynnBe commented 3 years ago

I think it should be 0.3.0, because we changed the authors field compared to 0.2.0

after the back and forth general 0.2.0 allows both author formats (while this is a change of our original 0.2.0 definition it does include it and is the least strict) in general 0.3.0 we can then switch to the new authors only. sounds good? https://github.com/bioimage-io/spec-bioimage-io/blob/8157d555cba0d70ff24d2b14df35cfb254a34b23/bioimageio/spec/v0_3/schema.py#L92-L96

oeway commented 3 years ago

after the back and forth general 0.2.0 allows both author formats (while this is a change of our original 0.2.0 definition it does include it and is the least strict) in general 0.3.0 we can then switch to the new authors only. sounds good?

Sounds good! That might be even easier for the user to write RDF.

constantinpape commented 3 years ago

after the back and forth general 0.2.0 allows both author formats (while this is a change of our original 0.2.0 definition it does include it and is the least strict) in general 0.3.0 we can then switch to the new authors only. sounds good?

Sounds good, let's go with this solution!

oeway commented 3 years ago

This is all resolved, closing now.

Thanks everyone involved in the discussion!

bioimage-io / spec-bioimage-io

[RDF] Make the spec for attachment less strict #148