Provide Versioning for Data

sstemann commented 1 year ago

This came up in Michel's relay session on 9/22.

gprice1129 commented 1 year ago

@sstemann @sierra-moxon what type of issue is this? Needs more context and description.

sierra-moxon commented 1 year ago

This can be a pretty big topic, with lots of nuance. It's probably worth a small WG to put together the guidelines, but that being said, here is a reference: https://www.go-fair.org/fair-principles/ on F.A.I.R we can follow. The original F.A.I.R paper is here: https://www.nature.com/articles/sdata201618 @micheldumontier was an author so he is likely far more qualified to speak to requirements here than I am. @vemonet also contributes to software to help judge (I think?) the FAIRness of a data source: https://github.com/MaastrichtU-IDS/fair-enough :) - I ran the UI through a couple of the metrics: https://fair-enough.semanticscience.org/evaluations/089890842231f9ae671601650a135f9c46aa460d

Longer term, we also probably want folks to use translator (or its components?) in workflows. Being able to reproduce those is also super important, and here's a paper on that: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011369

My limited experience in this area is from previous organizations, where we committed to some very basic data reproducibility guidelines. We wanted a user who downloaded data from us today, to come back in a month (or 6 months, or a year, or 10 years) and re-download those same results. If the system had moved on to a new version, then we wanted the user to be able to recreate the data they downloaded by either retrieving it from an archive or recreating it with documentation of how to put the system together to generate the results again.

Could we do any of the following? 1) Establish a version for "Translator"(that is reflected in the header of download files). Somewhere, we should document all the versions of the underlying components that are used in an overarching "Translator version"? E.g. Translator version is 1.0.0, Biolink version is 3.5.6, TRAPI version 1.4.0, etc...Hopefully, these versions would be reflective of the versioning used in GitHub. 2) a date that the file was downloaded - or perhaps the PK of the query would be a better reliable identifier to store long term to retrieve the same results? see #556 3) possibly a "bulk" download site that allows users to download the results en masse, if we go this way, we will want to keep archives of the bulk downloads, with appropriate metadata as well (e.g. when the data was created, what versions, etc). 4) some indication of the provenance of the data (e.g. be sure to include the inforeses?) 5) Reflect metadata in different ways depending on the serialization of the download (e.g. as JSON, or CSV, etc.)

gprice1129 commented 1 year ago

It sounds like this needs a lot of discussion and breaking down into smaller tasks. Nothing actionable in terms of development yet.

Genomewide commented 1 year ago

I agree with @sierra-moxon that Translator as a whole needs a task force or something for this. I don't think this is a UI issue. @sstemann what do you think? Assume we can display or tag any version ID for an answer or result (that is provided). I feel like that is where the UI comes in. Until that is provided I don't think it is our issue. We could provide a 'release ID of the UI, but that is it until then. IMHO what ever it ends up being should be included in the TRAPI standard? But I am no expert on this type of thing.

sstemann commented 1 year ago

makes sense to me - @Mathelab do you have insight here on what you would want to see?

gaurav commented 1 year ago

The NodeNorm part of this issue is being tracked at https://github.com/TranslatorSRI/NodeNormalization/issues/218

Mathelab commented 1 year ago

Generally speaking, it does not seem reasonable to me, nor that useful, to save all results, largely because there is an infinite amount of queries that Translator will eventually need to deal with. That being said, some query results should probably be saved for testing/benchmarking, and for understanding how changes in the underlying data/algorithms impact results. With that in mind, it makes sense to version: 1) the underlying data used to answer queries (e.g. ARAs); 2) the processes used for scoring (ARAs + Appraiser + ARS). At any point, users should be able to reproduce the results by applying 1) and 2) (although realistically, this would not be trivial given the complexity of the system, but that is most often the case for any system). This would get us closer to FAIR I think. As for the UI part of all this, it's a bit less clear to me.

gglusman commented 1 year ago

My recollection of the moment this arose during Michel's session at the relay is that the question was about whether BTE/Service Provider kept old versions of KGs when releasing new ones. For example, when I released version 1.7 of the Wellness KG via Service Provider, the previously available version became unavailable. So my recollection is that Michel expressed the need for versioning at that level, since the computed results may rely on previous versions of KGs. (Of course, this is not specific to BTE/Service Provider.)

Mathelab commented 1 year ago

This may be worth a look as well: https://www.nature.com/articles/s41597-023-02298-6

Mathelab commented 1 year ago

@gglusman Yes, I agree it's a traceability all the way down to the KGs. Ideally there would be a snapshot of the KGs used to generate the results, as well as the code used to produce the scores. Was there general agreement on this in that session?

codewarrior2000 commented 1 year ago

At the September Relay, a stated aspiration is to see Translator become as indispensable as PubMed. If we want to get to that point, future authors of papers should be able to cite an easy, simple link or number so that their readers can swiftly retrieve results from Translator. Therefore, that kind of reproducibility of results, one way or another, is non-negotiable.

NCATSTranslator / Feedback

Provide Versioning for Data #555