pre-populated annotations

kthrog commented 4 years ago

This feature would pre-populate some annotations based on existing information associated with the file.

Examples:

If filename ends in .fast.qz, then set fileFormat to fastq
Use location of files selected for annotation to populate studyID, studyName, and fundingAgency

It's possible this issue could have some overlap with what Robert just filed, Sage-Bionetworks/HTAN_data_curator#96.

BrunoGrandePhD commented 3 years ago

From my understanding, this request appears distinct from what Robert is requesting in #312 (moved to this repository from the Data Curator repository). Here, we want to allow some logic/rules to dynamically auto-complete values in the manifest based on other values in the manifest (or even values from outside of the manifest, such as the parent Synapse project, which often encodes information on the contributing study/site). On the other hand, #312 has to do with using Synapse annotations to pre-populate the template manifest as it's being generated.

While I agree that this would be extremely useful, I think it would be quite hard to implement, especially with the degree of flexibility that @kthrog is suggesting, which I think would be a minimum to achieve for this to actually be useful. IMHO, there are two main challenges:

Encoding the rules. For this feature to work, we would need to provide a way for users to somehow encode the rules. The JSON-LD schema would be a natural venue for this logic, but I don't know how we could capture anything more complicated than "this value here implies this value there", especially in a programming language–agnostic way. On the other hand, any logic could be encoded using a programming language like Python or R, but this leads to the second challenge.
Executing the rules. For this feature to be useful, the rules would have to be encoded in the Google Sheet template using built-in or custom functions. This way, the logic is executed as values are inputted. So, it would be hard to use custom Python and R code for this task. Admittedly, some values could be pre-populated (such as the study ID based on the file's parent folder/project), but I expect most use cases of this feature to rely on values provided in the Google Sheets. Any code that would be written to translate the rules for Google Sheets would also have to be written to support Excel, making this quite hard to support long-term.

@milen-sage, what do you think?

milen-sage commented 3 years ago

From my understanding, this request appears distinct from what Robert is requesting in #312 (moved to this repository from the Data Curator repository). Here, we want to allow some logic/rules to dynamically auto-complete values in the manifest based on other values in the manifest (or even values from outside of the manifest, such as the parent Synapse project, which often encodes information on the contributing study/site). On the other hand, #312 has to do with using Synapse annotations to pre-populate the template manifest as it's being generated.

Yes, that's accurate.

While I agree that this would be extremely useful, I think it would be quite hard to implement, especially with the degree of flexibility that @kthrog is suggesting, which I think would be a minimum to achieve for this to actually be useful. IMHO, there are two main challenges:

Encoding the rules. For this feature to work, we would need to provide a way for users to somehow encode the rules. The JSON-LD schema would be a natural venue for this logic, but I don't know how we could capture anything more complicated than "this value here implies this value there", especially in a programming language–agnostic way. On the other hand, any logic could be encoded using a programming language like Python or R, but this leads to the second challenge.

I'd be curious to see if we can even capture just "this value here implies this value there" types of rules. Right now, I don't think we have a suitable relationship mechanism in the json-ld schema to do that, so we might need to add that on the data model specification side. On the data validation side, @BrunoGrandePhD could you check if jsonschema can encode logic that let's you do "this value here implies this value there"? I might be wrong but I don't think there is a pre-specified keyword/construct that does that; it is probably feasible by nesting conditional logic.

Executing the rules. For this feature to be useful, the rules would have to be encoded in the Google Sheet template using built-in or custom functions. This way, the logic is executed as values are inputted. So, it would be hard to use custom Python and R code for this task. Admittedly, some values could be pre-populated (such as the study ID based on the file's parent folder/project), but I expect most use cases of this feature to rely on values provided in the Google Sheets. Any code that would be written to translate the rules for Google Sheets would also have to be written to support Excel, making this quite hard to support long-term.

Correct, this would require thinking a bit harder about the front end:

the attributes that determine values of other attributes can be presented as a form in the data curator UI prior manifest generation (this may relate to https://github.com/Sage-Bionetworks/data_curator/issues/53 except the answers are not just yes/no, but attribute values); could we have config file that powers a shiny form, where the config file is generated based on the schema rules above?
if we decide to have dynamic dropdowns in a google sheet, similarly to https://productivityspot.com/dependent-drop-list-google-sheets/, then that would be useful for #179. On a second thought, this whole feature seems almost like a special case of #179. I guess my main questions if we decide to go this google sheet route would be --- could these dynamic dependent dropdowns be generated via the google sheet API --- are they preserved when exported to excel (probably yes)

@milen-sage, what do you think?

kthrog commented 3 years ago

I don't think I have anything to add here, @BrunoGrandePhD. @milen-sage covered it more thoroughly than I probably could have, but let me know if you have additional questions!

Sage-Bionetworks / schematic

pre-populated annotations #311