Open thoelken opened 2 years ago
Hi Clemens, Thanks for these inputs! My replies are below
Would it make sense to define a core sample description set among the tables that is implicitly part of all of the metadata sets? This way, changes to one of those fields in a data set would not propagate to the others without the need to edit the field in all of the different places, circumventing inconsistencies. All other sets would be extensions of this core metadata set. A contra argument would be, that the proposed tables become more complex to understand.
Yes I understand, it would make more sense. But I also struggle to find a readable way to do it, without it being an overkill. Sample is the only common field I think. If you want to try something in a branch, please feel free and we can do an pull request review.
Another unrelated suggestion would be to track the mapping to other standards more explicitly by splitting the "source" column into "source_var" and "source_set" to make later import/export mappings easier to implement/automate. If the same field is mapping to multiple sources, "source_var" and "source_set" would consist of multiple column/set names separated by semicolon: e.g. source_var: seq_meth;INSTRUMENT and source_set: MIGS_BAv5;ENA
That is a really valid point thanks! I will try to sketch something in a branch as well to see how it would look. This would definitely improve the interoperability.
On that aspect, our NMDC colleagues have a headstart with the LinkML which is a general purpose modeling language that can be used with linked data, JSON, and other formalisms and allow them to bridge existing standards and ontologies. Maybe something to consider as an inter-consortium collaboration to not reinvent the wheel? Best, Charlie
Hey Clemens, hey Charlie, sorry for being late to comment. While I'm not sure if this solves exactly what you want, I stumbled upon the fairgenomes module overview that might be a first solution to what you are looking for.
Would it make sense to define a core sample description set among the tables that is implicitly part of all of the metadata sets? This way, changes to one of those fields in a data set would not propagate to the others without the need to edit the field in all of the different places, circumventing inconsistencies. All other sets would be extensions of this core metadata set. A contra argument would be, that the proposed tables become more complex to understand.
Yes I understand, it would make more sense. But I also struggle to find a readable way to do it, without it being an overkill.
For example, they structure the tables by modules and hyperlink to the respective module which is a seperate table in the same document/ markdown file, e.g. study.
I don't know if this is possible to apply across different documents/ tables though.
Sample is the only common field I think. If you want to try something in a branch, please feel free and we can do an pull request review.
I didn't check the technical metadata, but the module overview could possibly be applied to the BioEnv tables, as we currently have "site metadata" that overlaps almost identically in all biomes (with differences in specific examples). However again, I don't know if this is possible to apply across different documents/ tables though.
In terms of readability, I think their solution for a metadata overview is quite neat and readable (for me) at least.
Would it make sense to define a core sample description set among the tables that is implicitly part of all of the metadata sets? This way, changes to one of those fields in a data set would not propagate to the others without the need to edit the field in all of the different places, circumventing inconsistencies. All other sets would be extensions of this core metadata set. A contra argument would be, that the proposed tables become more complex to understand.
Another unrelated suggestion would be to track the mapping to other standards more explicitly by splitting the "source" column into "source_var" and "source_set" to make later import/export mappings easier to implement/automate. If the same field is mapping to multiple sources, "source_var" and "source_set" would consist of multiple column/set names separated by semicolon: e.g. source_var: seq_meth;INSTRUMENT and source_set: MIGS_BAv5;ENA
In both cases I am not sure whether this worth the effort or whether the result is actually desired.
Cheers, Clemens.