Sage-Bionetworks / schematic

Package for biomedical data model and metadata ingress management
https://schematicpy.readthedocs.io/en/stable/cli_reference.html
MIT License
22 stars 26 forks source link

pre-populated annotations #311

Closed kthrog closed 4 days ago

kthrog commented 4 years ago

This feature would pre-populate some annotations based on existing information associated with the file.

Examples:

It's possible this issue could have some overlap with what Robert just filed, Sage-Bionetworks/HTAN_data_curator#96.

BrunoGrandePhD commented 3 years ago

From my understanding, this request appears distinct from what Robert is requesting in #312 (moved to this repository from the Data Curator repository). Here, we want to allow some logic/rules to dynamically auto-complete values in the manifest based on other values in the manifest (or even values from outside of the manifest, such as the parent Synapse project, which often encodes information on the contributing study/site). On the other hand, #312 has to do with using Synapse annotations to pre-populate the template manifest as it's being generated.

While I agree that this would be extremely useful, I think it would be quite hard to implement, especially with the degree of flexibility that @kthrog is suggesting, which I think would be a minimum to achieve for this to actually be useful. IMHO, there are two main challenges:

  1. Encoding the rules. For this feature to work, we would need to provide a way for users to somehow encode the rules. The JSON-LD schema would be a natural venue for this logic, but I don't know how we could capture anything more complicated than "this value here implies this value there", especially in a programming language–agnostic way. On the other hand, any logic could be encoded using a programming language like Python or R, but this leads to the second challenge.

  2. Executing the rules. For this feature to be useful, the rules would have to be encoded in the Google Sheet template using built-in or custom functions. This way, the logic is executed as values are inputted. So, it would be hard to use custom Python and R code for this task. Admittedly, some values could be pre-populated (such as the study ID based on the file's parent folder/project), but I expect most use cases of this feature to rely on values provided in the Google Sheets. Any code that would be written to translate the rules for Google Sheets would also have to be written to support Excel, making this quite hard to support long-term.

@milen-sage, what do you think?

milen-sage commented 3 years ago

From my understanding, this request appears distinct from what Robert is requesting in #312 (moved to this repository from the Data Curator repository). Here, we want to allow some logic/rules to dynamically auto-complete values in the manifest based on other values in the manifest (or even values from outside of the manifest, such as the parent Synapse project, which often encodes information on the contributing study/site). On the other hand, #312 has to do with using Synapse annotations to pre-populate the template manifest as it's being generated.

Yes, that's accurate.

While I agree that this would be extremely useful, I think it would be quite hard to implement, especially with the degree of flexibility that @kthrog is suggesting, which I think would be a minimum to achieve for this to actually be useful. IMHO, there are two main challenges:

  1. Encoding the rules. For this feature to work, we would need to provide a way for users to somehow encode the rules. The JSON-LD schema would be a natural venue for this logic, but I don't know how we could capture anything more complicated than "this value here implies this value there", especially in a programming language–agnostic way. On the other hand, any logic could be encoded using a programming language like Python or R, but this leads to the second challenge.

I'd be curious to see if we can even capture just "this value here implies this value there" types of rules. Right now, I don't think we have a suitable relationship mechanism in the json-ld schema to do that, so we might need to add that on the data model specification side. On the data validation side, @BrunoGrandePhD could you check if jsonschema can encode logic that let's you do "this value here implies this value there"? I might be wrong but I don't think there is a pre-specified keyword/construct that does that; it is probably feasible by nesting conditional logic.

  1. Executing the rules. For this feature to be useful, the rules would have to be encoded in the Google Sheet template using built-in or custom functions. This way, the logic is executed as values are inputted. So, it would be hard to use custom Python and R code for this task. Admittedly, some values could be pre-populated (such as the study ID based on the file's parent folder/project), but I expect most use cases of this feature to rely on values provided in the Google Sheets. Any code that would be written to translate the rules for Google Sheets would also have to be written to support Excel, making this quite hard to support long-term.

Correct, this would require thinking a bit harder about the front end:

@milen-sage, what do you think?

kthrog commented 3 years ago

I don't think I have anything to add here, @BrunoGrandePhD. @milen-sage covered it more thoroughly than I probably could have, but let me know if you have additional questions!