bio-tools / biotoolsSchema

biotoolsSchema : Tool description data model for computational tools in life sciences
Creative Commons Attribution Share Alike 4.0 International
36 stars 12 forks source link

How to handle one input, but multiple ways to specify it #83

Open joncison opened 7 years ago

joncison commented 7 years ago

i.e. the classic example where a tool processes a sequence but this can be specified as a raw sequence or by an identifier.

seems to me the natural way to model this is to allow 1...many Data operations for an Input or Output; however very clear guidelines would be needed, i.e. we want "many" Data operations to imply that this input can be specified in more than one way, and not that this input can be considered as two types of data.

this issue is just intended to get a discussion going .... cc @matuskalas

joncison commented 6 years ago

cc @hansioan @baileqi @ekry @matuskalas

I'm somewhat loath to change this, because there would be many knock-on consequences for the various UIs that adhere to the model.

What we have currently (which cannot cope with the input that is "raw sequence" or "sequence identifier scenario above): capture

From an XSD perspective it can easily enough be "fixed", thus: capture

Note you can now specify multiple pairs of data+format for a given input. But as I say, I'm loath to do so because of the knock-on effects (UIs, API ...) I'm probably, at this stage, leaning towards not making this change, but I'm not sure.

Thoughts please ...

joncison commented 6 years ago

Latest thoughts on this (and 90% sure to be included in biotoolsSchema 3.0.0 thus bio.tools) are here.

matuskalas commented 6 years ago

👍 Well, you know my thoughts on this, as they haven't changed :-) (see also related but different #2)

Still, I don't understand your suggestion (XSD change) in https://github.com/bio-tools/biotoolsSchema/issues/83#issuecomment-341667924 (i.e. https://user-images.githubusercontent.com/1506863/32369707-1e621a3e-c082-11e7-8ee8-2921dccb4f3f.PNG). (If you mean <xs:sequence maxOccurs="unbounded"> then I wouldn't suggest it as a simple hack, because it's generaly not recommended and hard to parse. And probably won't work in JSON at all.)

A couple of options, in order of sophistication:

  1. Loath to change and fear of getting repetitive requests.

  2. Add an option of an "OR" logic between inputs (and outputs). Implementable in various ways. (Let me think in the meantime about one or more simple ways.)

  3. Allow multiple EDAM Data concepts for one input/output (fixing the related part of #2), and add a separate "OR" logic as mentioned one above.

I'd very much suggest either 1. or 3., i.e. either all, or nothing (and all in the future).

matuskalas commented 6 years ago

Implementation suggestions:

(without <xs:sequence maxOccurs="unbounded">)

  1. Three options:

a) Simple option with a new mandatory element parameter:

<xs:element name="input" minOccurs="0" maxOccurs="unbounded">
    <xs:complexType>
        <xs:choice>
            <xs:sequence>
                <xs:element name="or" maxOccurs="unbounded">
                    <xs:complexType>
                        <xs:sequence>
                            <xs:element name="parameter" type="dataType" minOccurs="2" maxOccurs="unbounded"/>
                        </xs:sequence>
                    </xs:complexType>                   
                </xs:element>
                <xs:element name="parameter" type="dataType" minOccurs="0" maxOccurs="unbounded"/>
            </xs:sequence>
            <xs:element name="parameter" type="dataType" maxOccurs="unbounded"/>
        </xs:choice>
    </xs:complexType>
</xs:element>

This option is unable to express (A and B) or (C and D) nicely, because of looking and behaving like the conjunctive normal form ;-) (A and B) or (C and D) <=> (A or C) and (B or C) and (A or D) and (B or D)

b) Cleaner option with a cleaner xs:choice, and backwards compatible with the current schema, i.e. no new mandatory elements:

<xs:element name="function" minOccurs="0" maxOccurs="unbounded">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="operation" maxOccurs="unbounded">
            ...
            </xs:element>
            <xs:element name="input" type="dataType" minOccurs="0" maxOccurs="unbounded">
            <xs:element name="output" type="dataType" minOccurs="0" maxOccurs="unbounded">
            <xs:element name="or" minOccurs="0" maxOccurs="unbounded">
                <xs:complexType>
                    <xs:choice>
                        <xs:element name="input" type="dataType" minOccurs="2" maxOccurs="unbounded">
                        <xs:element name="output" type="dataType" minOccurs="2" maxOccurs="unbounded">
                    <xs:choice>
                </xs:complexType>
            </xs:element>
            <xs:element name="comment" minOccurs="0">
            ...
            </xs:element>
            <xs:element name="cmd" minOccurs="0">
            ...
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

Still, this can't express (A and B) or (C and D) nicely, as it's still based on the conjunctive normal form.

c) Or a super-clean, without xs:choice but with 2 new mandatory elements, looking and behaving like the disjunctive normal form:

<xs:element name="input" minOccurs="0" maxOccurs="unbounded">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="option" maxOccurs="unbounded">
                <xs:complexType>
                    <xs:sequence>
                        <xs:element name="parameter" type="dataType" maxOccurs="unbounded"/>
                    <xs:sequence>
                </xs:complexType>                   
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

This enables expressing also (A and B) or (C and D) nicely. The only inconvenience to pay is that it forces to copy e.g. (A and B) or (C and B) or (D and B) if B is always mandatory. But from the user's point of view, this is not a problem, but rather a clear enumeration of the choices! The only inconvenience stays, that at least one overarching option element has to be always added. Still, the cleanest ultimate solution!

  1. Add also <xs:element name="data" type="EDAMdata" maxOccurs="unbounded"> in dataType and its xs:restrictions to fix the semantic part of #2. I'm sure the backend and GUI fixes will be trivial, as that is allowed for both Operations and Formats (just not Data). In general, this is trivial compared to 2.
joncison commented 6 years ago

Thanks a lot for this. We should give it more thought. I don't want to change anything for the next release, for fear of changing too many things all at once (esp. something at the core of the model, like this).

For now we have nice clear guidelines, and we can improve on things, most probably, as soon as the quality of the bio.tools entries has improved a bit and a more sophisticated approach is warranted.

matuskalas commented 6 years ago

Ok, @joncison.

Should we update at least the 3. (<xs:element name="data" type="EDAMdata" maxOccurs="unbounded"> in dataType and its xs:restrictions) ?

joncison commented 6 years ago

For now I'm inclined to leave it as-is, i.e.: capture

but revisit once the existing annotations are improved, and such deeper annotation is desirable. Bear in there's a big ongoing clean-up of existing EDAM topic and operation annotations (https://biotools.sifterapp.com/issues/156) and until that's finished, data and format (whilst super important) are a secondary concern ... for now! cc @hansioan