bio-tools / biotoolsSchema

biotoolsSchema : Tool description data model for computational tools in life sciences
Creative Commons Attribution Share Alike 4.0 International
36 stars 12 forks source link

Community definition & approval of tool information standard #77

Closed joncison closed 6 years ago

joncison commented 7 years ago

As part of a drive to improve the quality of content in https://bio.tools, we now have a candidate information standard, which defines the attributes that must be defined for an entry to be nominated (possibly eventually labelled) as of "minimal", "silver" or "gold" standard quality. The standard is based upon biotoolsSchema (https://github.com/bio-tools/biotoolsschema) and will underpin bio.tools quality metrics and KPIs (under development), guiding all future curation work.

See: https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst#information-requirement

Please can I get some feedback here?

hansioan commented 7 years ago

The standards checklist is missing the following attributes:

In the checklist I would change "Persistent human readable URL" to also contain "homepage". The word homepage offers a clearer picture of the goal of this attribute. Furthermore, in table below we mention it as Homepage.

In my opinion, Scientific input/output formats are by far the hardest properties to annotate, they take a lot of time to find. I would at least move the formats to gold standard, if not data too, and move documentation and / or license to silver standard (or some other attributes).

joncison commented 7 years ago

In clarification, we need to be selective about including attributes in the standard (there are over 50 defined in biotoolsSchema) e.g. I don't see the case for including <collection>. I figured <download> was covered already by "Repository".

I'd say <cost> and <accessibility> are stronger candidates for inclusion whereas <operatingSystem>, <language> and <maturity> are weaker.

What other types of link are you suggesting to include?

I agree re data & format being hard to supply, but then these are high value. When deciding to include at "silver" or "gold" we have to bear in mind we'll be setting the curation priority: good to hear what others think.

magnuspalmblad commented 7 years ago

Data types and data formats are of course both important. The type provide context for the scientific operation. The current proposal draws the line between basic and silver between operation and data types. The data types are more abstract, like the operation, whereas the data formats are more concrete. Perhaps I would feel equally comfortable drawing the line between data type and a list of supported data formats. As you say, these are much more time consuming to supply (unless you are the developer). Often one would have to install and test the software with files of different formats and see what works, as file format support may not even be completely described in the software documentation.

I agree with the order of the requirement and the cut between silver and gold. The only change I would consider would be to add data types to the basic standard, as long as annotators are still allowed to save an entry before reaching the "basic" level. Or is the idea to formally require at least the basic level?

veitveit commented 7 years ago

Nice distinction!

+1 to Hans' suggestions to add more attributes such as collection and operating system. Operating system, language and download (download type) could go into silver.

Data and Format are difficult to annotate but on the other side are very important to accurately describe a software. We therefore need to push to get them into bio.tools as much as possible, leaving them either at silver or even moving Data to basic.

2017-05-26 14:14 GMT+02:00 Magnus Palmblad notifications@github.com:

Data types and data formats are of course both important. The type provide context for the scientific operation. The current proposal draws the line between basic and silver between operation and data types. The data types are more abstract, like the operation, whereas the data formats are more concrete. Perhaps I would feel equally comfortable drawing the line between data type and a list of supported data formats. As you say, these are much more time consuming to supply (unless you are the developer). Often one would have to install and test the software with files of different formats and see what works, as file format support may not even be completely described in the software documentation.

I agree with the order of the requirement and the cut between silver and gold. The only change I would consider would be to add data types to the basic standard, as long as annotators are still allowed to save an entry before reaching the "basic" level. Or is the idea to formally require at least the basic level?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bio-tools/biotoolsSchema/issues/77#issuecomment-304268190, or mute the thread https://github.com/notifications/unsubscribe-auth/APEZhZQZydKAIqpuH3KbMF2cWHWxCOlFks5r9sIqgaJpZM4NmSyw .

-- |||/ (o o) ----ooO-(_)-Ooo----

Don't worry about life; you're not going to survive it anyway.

http://computproteomics.bmb.sdu.dk

hansioan commented 7 years ago

Regardless if we want to add more attributes to any of the standards, it would be good to have a place where we mention (and possibly describe) all of them. I've looked at: https://github.com/bio-tools/biotoolsSchema/blob/master/README.md#documentation and I cannot find the complete list of attributes , except the huge html file which github won't render. This big html contains way too much detail to be the "default" place to view all attributes and is hard to navigate. If there is another resource that describes the attributes and I've missed it please point me to it.

As a UI related point, it would be nice at registration (or validation) time and in the search results to show (e.g. using icons) to show the information standard level of each tool.

My opinion on EDAM data and formats is that they take around 50% of the entire annotation process, just because they are hard to get right unless you're the developer. I believe that the "basic" standard should be a goal fairly straightforward, but mandatory to reach. On the other hand, the "basic" standard should also provide enough information to get a "basic" idea of what the tool does. As an example, in a basic tool ,if I see the Topic "Sequence analysis" and the operation "Sequence alignment", I can kinda infer what the input data and formats should be. That being said I think that data and formats should not be in basic, and what we have now in "basic" is enough. If I were to add something to basic, I would add contact information.

Another point I want to make is that, if one reaches "silver" standard by annotating all the required "silver" standard attributes, getting to gold is a trivial task as one would only add a handful of attributes which are fairly easy to find. Whether the hard jump should be from "basic" to "silver" or from "silver" to "gold", I am unsure.

hansioan commented 7 years ago

@joncison There are some >> in the "element" tags column in the attributes table. Examples: <function><operation>> <function><input>/<output><data>> I assume these are by mistake?

joncison commented 7 years ago

Thanks everyone! Already we have some excellent inputs, for now (until we hear from others) I'll just pick up on your questions:

:: "Or is the idea to formally require at least the basic level?" Yes; all attributes would have to be specified before saving. NB. we have a legacy to deal with including many entries that are below the minimum standard (often because "top-level" EDAM annotations e.g. Topic=="Topic" have been used, just to get the validation to pass).

:: "Regardless if we want to add more attributes to any of the standards, it would be good to have a place where we mention (and possibly describe) all of them." Every attribute (including even the values within enumerations) are comprehensively documented in the XSD itself. From the XSD, an HTML is generated: see https://github.com/bio-tools/biotoolsschema (second link).

joncison commented 7 years ago

oh, I'm not sure about including <collection> : many tools, in fact, do not belong to a collection?

thanks Hans for spotting '>>' (fixed)

hmenager commented 7 years ago

The whole page looks fine to me, but there is a degree of ambiguity between two names used for the same thing: software type and tool type. It would probably be more consistent to use one name, otherwise people might think these are two distinct informations.

joncison commented 7 years ago

Thanks ... will fix. Bearing in mind the standards (minimum, silver, gold) will set curation priorities, are the right attributes listed at the right level? Are attributes defined in the schema (including enum options) which should be in the standard, but which are not currently listed? cc @matuskalas also for his inputs.

ekry commented 7 years ago

I've given it some thoughts and here they are, in no particular order:

joncison commented 7 years ago

Thanks, some really good points there. For now (until we hear from others a bit more) I'll just pick up on a few points:

:: "We should rethink what the exact purpose of this is, and design it to fit this purpose. " To my mind it's both purposes you mentioned:

I agree it would be really nice to combine it with a LinkedIn-style completeness; we have 3 levels currently (actually 4 if you include "sub-minimum") and we could add 1 more; this would yield 5 levels or 5 stars. In this case, we could replace "minimum", "silver", and "gold" with simple a 5-star rating:

Completely agree with quality requirements; which we can develop once we settle the basic standard / attributes. It could also include things like "EDAM annotations cannot be top-level

Cool URLs / IDs are important! (at least important enough to be mindful there is no grot). The requirement could simply be "tool ID (from name) has been manually inspected" (i.e. what we've already been discussing / doing for existing entries)

joncison commented 7 years ago

Quick update from discussion with @ekry :

krab1k commented 7 years ago

I think the 'gold standard' misses the point. It mixes the characteristics of a tool with the characteristics of it's annotation. We should not be ranking the tool, we should be ranking the annotation. A tool with no issue tracker should not be punished by our standard. I.e. I believe that every bioinformatics tool that exists, in it's current form, should be able to have a gold standard annotation. Right now, an amazing tool with no issue tracker or mailing list will not be able to have a gold standard annotation. I think the gold standard annotation should be awarded to tools that have a complete annotation of the important attributes + minor attributes, such as maturity, language etc.

My thoughts on Gold standard exactly.

Instead of Repository, Mailing list, Issue tracker which some tools even don't have, I would include more useful fields (from a point of view of a bio.tools user) like Operating systems, Cost/Accessibility and Maturity. These can be provided for all tools and to me are even more useful for searching, one of the main purposes of bio.tools (tools for linux vs. tools with mailing list?).

Data formats are really difficult to annotate (as from our CZ experience as curators), so I would move them to Gold; Data types (I/O) seem to be fine in Silver (or some ?-star equivalent).

joncison commented 7 years ago

@ekry & I spoke about this in person, conclusion was we need to clarify what's meant by '?', tick and cross and in respect to the standard. e.g. for issue tracker

So e.g. "gold standard" could allow a cross or a tick, but not '?' (the implication is the bio.tools UI would need to support positive statements that such-and-such attribute is not available)

In general, I tend to agree : the standard should reflect annotation quality rather than tool or project "quality" (although of course the first can give an indication of the second two)

matuskalas commented 7 years ago

I agree with pretty much all what @ekry wrote 👍

@joncison, could we please put https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst#information-requirement into a googleDoc instead, so that people can comment to concrete points and suggest additions or edits directly? This discussion became nonviable, as the issue is very multi-faceted.

joncison commented 7 years ago

OK, I'll try to do that once I make a revision based on entire thread above, but first I want to hear from EE guys (cc @Ahto123 !) and the SIB guys

matuskalas commented 7 years ago

First most obvious 2 criticisms:

Sorry for sounding mostly negative, but these are the issues that need to be fixed before going out with it, in order not to make the bio.tools project (& us) look insane.

Otherwise I'm super positive to the idea of developing an info standard, but it must be done properly.

joncison commented 7 years ago

The idea for the (levels in the) standard was not assign a score, rather to list definite attributes that have to be specified (or alternatively, to have registered a definite statement that such a thing is not available). Bear in mind a key purpose is to guide the curators hand, such that they know what attributes they must specify when improving an entry.

Scoring an entry e.g. to get to a LinkedIn-style entry completeness, could be a nice complement (and would need attribute weights).

Agreed - the standard must respect the tool types (we had to start somewhere): it would be good to get a list of key attributes in each case - to use also in the emerging Curators Guide (http://biotools.readthedocs.io/en/latest/curators_guide.html)

hhbj commented 7 years ago

I'd like to propose that we include an attribute called ”version history” that would provide the following information: year of 1st version of entry year of current version of entry and year of versions inbetween the two above, if those were big enough to merit a publication (link to publ should be provided to).

This attribute - version history - should be provided in order to be assigned 'gold standard'

joncison commented 7 years ago

Severine says .. "for me there is a confusion between the quality of the annotation and the quality of the resource itself. For instance: "Type of input & output data", "Supported data formats", as well as "Publication" are for me indicators of a high quality resource annotation: They are intrinsically related to the quality of the annotation, whereas "Issue tracker", "Mailing list", "Repository" or "Documentation" are related to the quality of the resource itself.

In other words, I can spend hours describing a resource and end up with a very high quality annotation, but, if, for whatever reason, the resource provider did not set up a mailing list, then my annotation will only be considered as "Silver". I think that the annotation quality criteria should only be related to the annotation process itself, not to the resource."

Jon says ... "this underlines a point made above and it will be addressed in the next iteration. Quite simply, we can insist on such things as “Issue tracker” being annotated, but only if in fact they are available. The implication is we need bio.tools UI to support positive statements that such-and-such a thing does not exist."

joncison commented 7 years ago

:: "year of 1st version of entry " I think this is already covered by existing Added field

:: "year of current version of entry" I think this is already covered by existing Updated field

The existing Version field can be used to annotate the specific version of the tool that is described by the entry.

In addition, the plan indeed is to allow association of a version ID with a publication ID (i.e. treat these as a couplet), at least in 1st instance before doing something more sophisticated for tool versions.

joncison commented 7 years ago

UPDATE

An iteration of the proposed standard, where I try to capture all points above and other conversations: https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst#information-requirement

In summary, "gold" etc. replaced by 5 star rating, (where 0 stars == sub-minimal):

I'm not sure I really like dropping Operation but it's food for thought ...

Crucially, attributes are only required if they are, in fact, available, the implication being ( @ekry !) we need to support positive statements in bio.tools UI that e.g. an issue tracker is not available. There are also type-specific requirements.

There's an open question as to whether manual inspection / validation of (all? specific?) attributes at (all? specific?) tiers is needed : my vote is prob. that to get any star rating at all, inspection by bio.tools admin (or other trusted party) is needed, I don't see any other reliable way of assuring quality in every case ... thoughts?

Great to now have more comments from everyone on this version ...

PS. See also the emerging Curators Guide: http://biotools.readthedocs.io/en/latest/curators_guide.html#

comments on it are welcome in a separate thread: https://github.com/bio-tools/biotoolsDocs/issues/6

jvanheld commented 7 years ago

I find it strange that "Operating system" and "Language" (I guess this means "programming language", not the human language of the interface) are at the "gold" level. These are easy fo fill informations, and of primary importance for a potential user to know if the tool is usable for them (at least for stand-alone tools).

In addition, some tools are distributed in form of libraries (e.g. python, R library).

Should "operating system" and "language" not be at the "minimal" level (one star) at least for For stand-alone tools at least (command-line or with specific GUI) and for libraries ?

Cheers

Jacques

martavillegas commented 7 years ago

Hi, Sorry, I'm a bit late ..

From text mining perspective, language is crucial but it refers to the language of input texts (users need to know if a given tool is for English or French texts).

Another important requirement is the possibility to have sample input/output this allows (i) a quick execution for testing purposes and (ii) an exact knowledge of what the tool does.

I know these things are not in the schema....

hansioan commented 7 years ago

I think footnotes 6 and 7 might be a little bit too specific:

The second point I want to mention is that, where we have attributes that might not apply to all tools, we need to have a placeholder annotation (e.g. N/A) to infer that the attribute has been inspected and marked as not applicable , rather than just missing information.

The final point is that, technically, a tool can jump from 2 stars to 5 stars by just adding Cost, Accessibility and Maturity because all the others are 'if applicable' conditioned. I don't know if this is a big problem, but it might be a way in which one can 'exploit the system' to reach 5 stars.

joncison commented 7 years ago

A few notes from dev meeting this morning (please @hansioan @ekry @baileqi @Ahto123 fill in any gaps):

  1. use of stars is no good, as people commonly understand this to be a user-rating : this emphatically is the wrong impression : we need something else
  2. for whatever labeling we go with, we must convey that 1-star is a good thing, maybe by appending e.g. "1 cherry (Good!)", "2 cherries (Great!)" etc.
  3. certain attributes (namely everything currently in 1-star rating, plus the EDAM annotations) really define the core of the standard, i.e. those attributes that must be specified at each level: for all other "misc attributes" it's somewhat arbitrary at what tier they should be listed. We could contemplate:

I think I favour option 2 but would like to hear what others think

  1. we shall apply whatever metrics to the data, and expose these in a progressive way (e.g. first internally only, then to providers only, then public, then public allowing sorting, or whatever)
dabrowskiw commented 7 years ago

Hi!

I hope I'm not too late to the party, but I'd personally like to also see the container and CWL fields at least in "gold" - apart from automated benchmarking I think that these things, especially the availability of a downloadable container, can be of interest for many users. "Source code" is nice, but the way from "I have the source" to "I can run it" can be long at times.

In general, I'm not a huge fan of giving a "star" rating based on simply the number of additional attributes that have been provided, since that does not really give, at a glance, a feeling for what kind of metadata is available - and that, for me, should be the goal of this kind of rating. I cannot imagine a use-case where I am looking for tools that have "at least 5 metadata-fields filled out".

An alternative suggestion: There could be, instead of stars or cherries, groups of metadata fields. Examples could be "Code availability" (this could be the fields "Repository", "Source code", "License"), "Documentation" (e.g. "API doc/spec", "General documentation", "Supported data format"), "Community" (e.g. "Issue tracker", "Mailing list"). And whenever all fields in a group are filled out, the tool gets the associated icon (or maybe if only one is filled, it gets the bronze version of the icon, if more the silver version, and if all the gold version). In that way, the completeness of metadata concerning specific topics of interest would be immediately visible, and there would still be a kind of "x stars" rating in the way of having a certain number of icons, but it would not try to sort these into tiers, working around the problem that for misc attributes, it's arbitrary at what tier they should be listed. So you could kind of have "star 4" without first having to have "star 2" and "star 3".

Just my two cents ;)

joncison commented 7 years ago

Not at all, inputs appreciated! I really like the idea of splitting out the "misc attributes" into thematic groupings - esp. because it chimes with the idea of metrics reflecting project maturity also. If we did that, it would leave the information standard with core attributes that are mandatory at different tiers.

@ekry ? @Ahto123 ? @hansioan ? Everyone?

veitveit commented 7 years ago

Agree as such a grouping could help to deal with the overlap between metric and information measures. With a similar system for metric and information standard, an interested person then can decide by their own which grouping they consider relevant for their assessment.

joncison commented 7 years ago

UPDATE

A new iteration is available: https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst#information-requirement

Highlights

Response to specific points

:: I find it strange that "Operating system" and "Language" (I guess this means "programming language", not the human language of the interface) are at the "gold" level. Operating system" and "Language" have been moved down.

:: Should "operating system" and "language" not be at the "minimal" level (one star) at least for For stand-alone tools at least (command-line or with specific GUI) and for libraries ? The minimal (now "OKAY") level is intentionally very minimal and includes only generic attributes, i.e. that apply to all types of tool.

:: From text mining perspective, language is crucial but it refers to the language of input texts (users need to know if a given tool is for English or French texts). This is more a concern of the schema itself, please post to https://github.com/bio-tools/biotoolsSchema/issues/new

:: Another important requirement is the possibility to have sample input/output this allows (i) a quick execution for testing purposes and (ii) an exact knowledge of what the tool does. Sample I/O are sort-of supported via <download><type>Test data</type> and ``download>Biological data, see https://bio.tools/schema#Link5B. This could perhaps be improved e.g. by adding types for "Sample input" and "Sample output" : please request this via https://github.com/bio-tools/biotoolsSchema/issues/new

:: The final point is that, technically, a tool can jump from 2 stars to 5 stars by just adding Cost, Accessibility and Maturity because all the others are 'if applicable' conditioned. I don't know if this is a big problem, but it might be a way in which one can 'exploit the system' to reach 5 stars.

:: I think footnotes 6 and 7 might be a little bit too specific: We need to document exactly what attributes are applicable to what tool types: if you'd like to take a shot at this, please add a table to https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst#information-requirement.

:: The second point I want to mention is that, where we have attributes that might not apply to all tools, we need to have a placeholder annotation (e.g. N/A) to infer that the attribute has been inspected and marked as not applicable , rather than just missing information.
This is essential (cc @ekry)

:: like to also see the container and CWL fields at least in "gold" - apart from automated benchmarking I think that these things, especially the availability of a downloadable container, can be of interest for many users. These now captured in "Downloads" grouping

:: An alternative suggestion: .... I think the groupings (with "At least one" etc. logic) mitigates somewhat the arbitrariness, e.g. any type of tool could/should have some Documentation.

:: With a similar system for metric and information standard, an interested person then can decide by their own which grouping they consider relevant for their assessment. I think the stage is set for specific labeling for groups, should we want this in the future.

To do

osallou commented 7 years ago

how do you specify "not being available" ?

joncison commented 7 years ago

Technically this is not settled, but it would require a tweak to biotoolsSchema, the corresponding bio.tools backend and JSON model, and the bio.tools UI. Couple of options spring to mind, rather than discuss them here I created a separate thread .

Thanks!

hansioan commented 7 years ago

I think this current iteration looks good. A few points from me:

dabrowskiw commented 7 years ago

I think this looks really nice. I would just, in addition to the smilies, separately show whether the requirements for the groups (Documentation, Accessability etc.) are fulfilled. That way, metadata availability would not be masked by the smiley levels - e.g. in the current case, it would not be relevant/visible for the rating whether binaries and code are available if the documentation is lacking.

magnuspalmblad commented 7 years ago

I am happy with the general direction and the proposed standard draft! The priority of requirements is better than a overall percentage of total completeness, as at least some of the information required for "Good" is essential to be able to do anything at all, and other information far less critical. The ordered list also suggests a workflow for annotators.

Just a minor comment on Matúš comment above that licensing information is irrelevant for web apps. This may be true for many of them, but certainly not all. Sometimes there are licenses attached to the output, e.g. ProteomicsDB or the underlying cartography in many geoparsers. Having an optional N/A in the annotation of an individual tool should be sufficient.

M.

hmenager commented 7 years ago

The whole effort is really nice, and the information model is good. What is the difference between "mandatory" and "okay" attributes? does one have to specify all okay attributes to create an entry? it seems to be so, and I'm worried that asking e.g. a publication might be an unnecessary block to the registration of entries - some good tools have no publication, but more importantly some might be registered with a "pending" publication.

osallou commented 7 years ago

I think that publication is a mandatory entry but that entry can be 'not available'. However i agree that a publication may also be never available. And specially for a pub, a 'not available' could mean 'there is none' or ' i dont know' ( if filling entry but not tool author).

Olivier

Le jeu. 15 juin 2017 23:19, Hervé Ménager notifications@github.com a écrit :

The whole effort is really nice, and the information model is good. What is the difference between "mandatory" and "okay" attributes? does one have to specify all okay attributes to create an entry? it seems to be so, and I'm worried that asking e.g. a publication might be an unnecessary block to the registration of entries - some good tools have no publication, but more importantly some might be registered with a "pending" publication.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bio-tools/biotoolsSchema/issues/77#issuecomment-308868677, or mute the thread https://github.com/notifications/unsubscribe-auth/AA-gYiuCtaOMgryM1NkCn7lko-XylVFLks5sEZ-_gaJpZM4NmSyw .

joncison commented 7 years ago

Thanks folks, again, for some useful comments. For now I just pick up on:

:: What is the difference between "mandatory" and "okay" attributes? does one have to specify all okay attributes to create an entry? it seems to be so, and I'm worried that asking e.g. a publication might be an unnecessary block to the registration of entries - some good tools have no publication, but more importantly some might be registered with a "pending" publication. The current min. info. requirement is somewhere between "OKAY" and "GOOD" (but without the verification of attributes as per Curation Guidelines). I imagine in future in well be set to "OKAY" but we can of course ask for more.

The test for publication (indeed all attributes labelled with an asterisk) can indeed be passed if specified as "Not available", the intended meaning being the annotator is fairly sure it doesn't exist, ... we'll have to make this clear in the UI somehow, and check it's not being (mis|ab)used

More generally, there's been an ongoing discussion re flexibility about the type of publication asked for. e.g. for BioConductor packages, the main BioConductor publication would be OK, An online manual might even be allowed (but then we'd have to support specification of a URL). The idea is, in the Autumn, to review what lacks a publication then make a decision.

Speak to some of you at 11AM CE(S)T today.

matuskalas commented 7 years ago

Pasted from the meeting chat window, 2017-06-16:

Henriette Husum Bak-Jensen [11:01 AM]: Please continue to bring your issues and comments on the topic to us via this link - throughout the meeting https://github.com/bio-tools/biotoolsSchema/issues/77

Jon Ison [11:01 AM]: I can

Piotr Chmura [11:03 AM]: I'm here

Piotr Chmura [11:04 AM]: yeah I can

Jon Ison [11:07 AM]: https://github.com/bio-tools/biotoolsSchema/issues/77#issuecomment-308868677

Jon Ison [11:09 AM]: https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst#information-requirement

Matus Kalas [11:17 AM] (publicly): The idea of manual verification is both unscalable and unfair. Sounds like 1. Waste of resources and 2. Soviet-style censorship

Piotr Chmura [11:18 AM]: There was an idea of side-project of machine-learning-driven autocurator

Piotr Chmura [11:18 AM]: there might be some funding for it

Piotr Chmura [11:18 AM]: interested?

Matus Kalas [11:18 AM] (publicly): In addition, there is a danger of the verification process being of lower quality than the annotation process

Matus Kalas [11:20 AM] (publicly): If there are such resources, they should be used for improving the annotation, merging entries, assigning owner and editors, in summary "curation" or "administration". Not "censorship" adding labels

Matus Kalas [11:22 AM] (publicly): Instead of adding labels, curators could contact the annotators and uncourage them to improve certain information

Matus Kalas [11:23 AM] (publicly): @ekry & all: We need "Does not exist" for some attributes

Emil Rydza [11:23 AM]: I agree

Matus Kalas [11:23 AM] (publicly): But shouldn't be overused

Matus Kalas [11:24 AM] (publicly): The major problem there is, that things like publications, repositories, or mailing lists may appear later, rendering the annotation severely wrong

Matus Kalas [11:25 AM] (publicly): That's the major danger

Piotr Chmura [11:25 AM]: then we update it

Piotr Chmura [11:25 AM]: or the owners do

Piotr Chmura [11:25 AM]: or people using it

Emil Rydza [11:25 AM]: the community essentially

Matus Kalas [11:25 AM] (publicly): Or not

Piotr Chmura [11:25 AM]: if they care about it, the information will reach us

Matus Kalas [11:26 AM] (publicly): Still, giving high score for outdated and wrong information, as opposed to lacking information, is disputable

Piotr Chmura [11:30 AM]: hence the idea for machine learning autobot that would periodically gather it from the internet

Hans Ienasescu [11:31 AM]: sorry I am late

Henriette Husum Bak-Jensen [11:39 AM]: N/A and NONE and unknown and Needs updating ARE ALL DIFFERENT INFORMATION

Matus Kalas [11:52 AM] (publicly): Some obvious quantitative measures: amount of information (most simply number of JSON/XML nodes); last modified (time since); last new version (time since); last scientific publication (time since)

Piotr Chmura [11:53 AM]: Not to be the grumpy one, but we tried to present those exact ones on some meetings and were trashed for it

Piotr Chmura [11:53 AM]: each of them being deemed less than informative, to put it mildly

Piotr Chmura [11:54 AM]: we are using them to detect weird entries

Piotr Chmura [11:54 AM]: e.g. too many topics vs avg/mean amount for similar tools

Matus Kalas [11:54 AM] (publicly): I agree with Emil

Matus Kalas [11:54 AM] (publicly): Change may not hurt if it's well-designed

Matus Kalas [12:04 PM] (publicly): I strongly disagree with both the verification stamp completely, and with greying something out after it's updated. The only option I could accept, is having that as either a hidden or not-too-visible information aimed at curators and helping the curators.

Matus Kalas [12:27 PM] (publicly): A good idea is to add as an additional information information such as : "Information verified by author", "Information verified by annotator", "Information verified by curator"

Hans Ienasescu [12:28 PM]: I agree with Matus' above point

Matus Kalas [12:28 PM] (publicly): Also, too-generic EDAM concepts should be discouraged my both the qualitative and quantitavive metrics

Matus Kalas [12:32 PM] (publicly): And perhaps also "Adherence to the annotation guidelines verified", all of those with a timestamp and user name. And there shoudl be a link to the annotation guidelines well visible inside the annotation GUI

With the "Verified" stamps, we need to clearly (for end user especially) distinguish between adherence to annotation guidelines VERSUS adherence to reality i.e. validity of the information.

joncison commented 7 years ago

My conclusions / notes from the meeting this morning (NB: these are not meeting minutes):

Applicability versus availability of information

Whether an attribute is applicable to a type of tool (i.e. as indicated by asterisks in current proposal) is for bio.tools to define.

Whether an attribute is available is for the editor of an entry to specify. We must assume integrity of our contributors who will not misuse this feature (but check and take remedial action as necessary)

@matuskalas agreed to provide a matrix of tool types versus attributes indicating applicability, with inputs from @hansioan. This should ignore obscure edge cases: even if an attribute is not (normally) applicable, it can still be specified during registration.

Date stamping "not available" annotations

It is useful to know (certainly by bio.tools admin, and maybe users) when e.g. a publication was stated as being "Not available". This also goes for license, and probably (but less strongly) for other annotations (all annotations that can be annotated as "Not available"? all annotations?)

This could help inform periodic updates of the registry. We wouldn't want to clutter the UI with such information though.

Consideration of tool type information

The matrix of tool types::attributes (applicability) - mentioned above - is necessary and sufficient for the current proposed model. It also lays the foundation for a more sophisticated, tool-type specific standard in the future which could, in principle, better reflect the specific information requirements of each type of tool.

Number of tiers in model

@matuskalas suggested 5 tiers is too many. My opinion is that 5 is OK, considering that the 1st ("NEEDS TO IMPROVE") I hope will not be much used, and that we need a few tiers to motivate incremental improvement of entries ("curation as a game")

Stability of model

The model will necessarily need to change (a bit at least) in years to come, but we must handle change very carefully, especially to ensure e.g. that an entry labelled today as "EXCELLENT" does not suddenly become "NEEDS TO IMPROVE" (in such cases, entries probably should be improved retrospectively before applying any new standard).

Restricting potentially "breaking" changes to once per year (i.e. as per biotoolsSchema) seems sensible.

Labeling entries

To clarify two types of label will be applied:

Verification of entries

This (c|w)ould be toxic if carefully annotated entries do not get a "VERIFIED" stamp, on the other hand, labeling is very important to identify / reward such high quality curation efforts. The verification burden should not purely be on bio.tools admin (to avoid being a blocker to this). The solution is to allow anyone (but especially bio.tools admin, Thematic Editors and other trusted curators) to verify entries. The requirement is effective tooling for this, such that a curator can tick-off guidelines that have been satisfied (we have a mock-up, but it's not yet public)

A verification date is needed and also a way to indicate whether the entry has been updated since it was verified (e.g. by changing the colour of the stamp).

Other

A number of other points were raised not directly related to the standard:

cossorzano commented 7 years ago

Our only comment is whether this standard supports an operation that is performed using two different programs in sequence. For example, the code availability allows describing both programs?

hhbj commented 7 years ago

The final minutes of the WP-1 centered bio.tools status meeting on INFORMATION STANDARDS: Attendeees WP-1 partners of which the following were present Anne Wenzel, Emil Rydza, Hans Ienasescu, Jon Ison, Matus Kalas, Piotr Chmura, Severine, Henriette Husum Bak-Jensen.

Apologies Vivi Raundahl Gregersen, Hedi Peterson, Veit Schwämmle, Vivi Gregersen, Ahto Salumets, Salva , Hervé Menager

Minutes The goal of todays meeting was to go over the proposed standards for tools entries in bio.tools (see https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst ). The minutes also offer fundamental concerns – that prompt for consideration before launching the standards Several comments were made at the meeting chat and also issues were brought up. Those can be found here https://github.com/bio-tools/biotoolsSchema/issues/77 and more can be added after the meeting, please. The main points - constructive discussion points and actions points – at the meeting, were the following:

The idea of 'revising the standards on an annual basis' is challenging

Four standard tiers/labels are contemplated (OKAY, GOOD, VERY GOOD, EXCELLENT) that are all of 'acceptable' quality. A fifth label (NEEDS TO IMPROVE) is for entries which lack basic information. Each label is associated with a set of attributes. The set of attributes required to earn a label – or the list of allowed sub-domains to tick a particular attribute, could in principle be changed – if practical experience shows it would be valuable. And so we envision to revisit, with caution, the set of four (five) standards on an annual basis – with input from the community BUT - by all means, any future change in the standards must not bereave a tool of an 'earned' label, or lead to a 'greying' of an annotation void. Rather such changes should apply to future earning of labels, and be presented in the 'background guide info for curators' for verified-label tools, that now needs more annotation work.

Annotation of Not applicable, None exists, Unknown, and Need Updating

These terms are all valuable information, and should be carefully and individually assigned as annotation options, for all attributes. MK made the point that distinct tool types warrant a distinct set of attributes – in order to avoid numerous 'not applicable' annotation results. It was agreed that MK will draft a matrix (tool types vs attributes) that will help decide if some tool types should indeed be assigned a distinct set of attributes, and if not, at least will help capture the adequacy of annotating 'not applicable' for a given tool attribute.

Annotation metrics – assessing quantitative measures on the quality input

This point was made by MK and wants to assess the registry's total number of annotated information on a given attribute. Other obvious quantitative measures include amount of information (most simply number of JSON/XML nodes); last modified (time since); last new version (time since); last scientific publication (time since). This will help us monitor the overall progress on quality of the registry as a supplement to tracking number of users and number of entries (quantity).

Date-stamps

The annotation 'None exists' should be time stamped, because it may be relevant to update the information. The annotation 'Not applicable' should not be Date-stamped, because it will never be relevant information.

Verification of labels

Several arguments were made for and against a Date-stamped verified label of a given tool. In particular if we're dealing with manual verification of earning a given label: 1) this could be seen as censure, by the developers, which would counteract his/her willingness to simply supply the best possible annotation/information on a given tool, 2) it is labour-intensive and possibly old-fashioned (not Wiki-like), 3) there is a danger of the verification process is of lower quality than the annotation process itself. On the other hand, the end user may better trust a manually verified date-stamped label. We need to consider the need of developers (the best provider of info) and of the end-user (trust issue). Including the possibility for developing a machine-learning-driven autocurator (Action Piotr Chmura). It is possible that the ressources spent of verification were better spent to improve the annotation.

joncison commented 7 years ago

UPDATE

Various minor tweaks have been made to the information standard: https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst#information-requirement

I think this is a “respectable beta” and implementation in bio.tools will now proceed.

Replies to specific points (not already address) above follow:

:: The final point is that, technically, a tool can jump from 2 stars to 5 stars by just adding Cost, Accessibility and Maturity because all the others are 'if applicable' conditioned. I don't know if this is a big problem, but it might be a way in which one can 'exploit the system' to reach 5 stars.

Operation (always applicable) is now mandatory at tier 3. More generally we have to assume honesty. To get the “Verified” stamp an entry will have to be manually verified and the name of verifier should be rendered next to the stamp (along with date) - this should discourage abuse.

:: how do you specify "not being available" ? Under discussion generally (https://github.com/bio-tools/biotoolsSchema/issues/82) and specifically or publications (https://github.com/bio-tools/biotoolsregistry/issues/200) : please join those discussions.

:: GOOD: we should start with a positive statement before the "would benefit part". Example: The entry is annotated at a decent/good information level/standard, but .... VERY GOOD: same as GOOD. Example: The entry is annotated at a superior information level/standard, but … EXCELLENT: same as above.

I take the point, but there’s a limit to what can squeezed into a useful infographic and besides, this doesn’t add to what’s already said by “GOOD”, “VERY GOOD” and “EXCELLENT.

:: I think this looks really nice. I would just, in addition to the smilies, separately show whether the requirements for the groups (Documentation, Accessability etc.) are fulfilled. That way, metadata availability would not be masked by the smiley levels - e.g. in the current case, it would not be relevant/visible for the rating whether binaries and code are available if the documentation is lacking.

This needs more thought, and we should try to define what metrics / labels would be useful including capturing some things from the groups mentioned. Let’s take this discussion elsewhere, e.g.

:: What is the difference between "mandatory" and "okay" attributes? does one have to specify all okay attributes to create an entry? it seems to be so, and I'm worried that asking e.g. a publication might be an unnecessary block to the registration of entries - some good tools have no publication, but more importantly some might be registered with a "pending" publication.

We’ll need to settle two things; i.e. what level in the standard is required:

i.e. not all entries will be visible, at least by default (possibly set by users)

Publication is marked as “if applicable and available”, i.e. a tool can be registered without one but the user will need to specify explicitly it is not available.

:: However i agree that a publication may also be never available. And specially for a pub, a 'not available' could mean 'there is none' or ' i dont know' ( if filling entry but not tool author).

We’ll need to spell out (in the UI, everywhere) that “Not available” means “to the best of my knowledge the tool has not been published”

A footnote: in the next release ~95% of entries will have at least one publication ID (often a overarching publication, e.g. for BioConductor packages the main BioConductor paper is OK)

:: Our only comment is whether this standard supports an operation that is performed using two different programs in sequence. For example, the code availability allows describing both programs?

This is more an issue of modelling in biotoolsSchema (https://github.com/bio-tools/biotoolsschema). In the case of workflows, I’m not sure that the “Source code” attribute would be applicable.

:: Annotation of Not applicable, None exists, Unknown, and Need Updating. These terms are all valuable information, and should be carefully and individually assigned as annotation options, for all attributes.

Not exactly.

:: Several arguments were made for and against a Date-stamped verified label of a given tool... .

I’m not sure the arguments there hold water, given that 1) anyone will be able to verify entries and 2) without some verification process, the standard could only ever describe entry completeness and not quality.

joncison commented 7 years ago

Heinz says ... "event the "excellent" label is only "quite" good. Gives the impression that the work is never finished :-) It needs to be carefully communicated so users of the portal don't confuse labelling of the entry in bio.tools with quality of the resource/tool that is annotated. An "OKAY"-labeled resource can very well be a perfect resource ... users usually care more about the latter"

joncison commented 6 years ago

In light of discussions at DKBC conference in DK (Aug) I made a revision (https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst)

This is sufficient to calculate "entry completeness" on bio.tools as it stands today (cc @hansioan, @ekry @hhbj )

Also agreed to take a shot at per-attribute standards, e.g. what constitutes a "GOOD" description etc. This is a work in progress.

joncison commented 6 years ago

Folks, another version now at https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst

This time providing major clarifications around possibilities for labelling, separating out the notions of "entry completeness", "valid syntax", "manually inspected" and "conformance to guidelines". The tiers now reflect an aggregation of these things (look and you'll see)

joncison commented 6 years ago

UPDATE

Following various discussions we can now settle a respectable first beta version for the tool information standard (https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_requirement.rst). Small but significant changes include:

In the 1st instance, it will be applied to bio.tools in two ways:

Whether / how to label entries can wait for later.

biotoolsSchema (https://github.com/bio-tools/biotoolsschema) will be revised to constrain at the level of "SPARSE" (https://github.com/bio-tools/biotoolsSchema/issues/88)

Good to get some feedback, because there have been many changes in the last few months.

joncison commented 6 years ago

UPDATE

Various improvements to make the standard more simple and more usable:

The big change is that "not available-ness" of attributes is now much more tractable (see https://github.com/bio-tools/biotoolsSchema/issues/82) and formal specification of "applicable-ness" is simply no longer required.

cc @hansioan (but everyone please) comment ...

joncison commented 6 years ago

Dear @hhbj, @cossorzano, @matuskalas, @osallou, @magnuspalmblad, @dabrowskiw, @hansioan, @martavillegas, @jvanheld, @krab1k, @ekry

Hans et al have been very busy improving the bio.tools content according to our "respectable beta" information standard for tools: https://github.com/bio-tools/biotoolsSchemaDocs/blob/master/information_standard.rst

Next month, I plan to start writing up the work on the definition of the standard, it's application to bio.tools and the broader context as a little publication, possibly in Briefings in Bioinformatics or maybe F1000. If you'd like to be involved with that publication, please email me now (jison@bioinformatics.dtu.dk) and I'll provide a link to the google document for the article.

For now, I'm closing this issue. Thanks for all the work thus far!