Closed henrikstranneheim closed 2 years ago
@Clinical-Genomics/bioinfo @Clinical-Genomics/laboratory Thoughts? Feedback much appreciated
I think it's a really good description of the simplest solution available. š
š very nice summary of yesterday's discussion. This will fix a lot of issues we have currently.
One thing we forgot yesterday, was analysis types for just QC for those customers that only need fastq delivered. So analysis type in order form should support this as well. What do you think?
Couple of comments:
This is multiomics and track support. Cancer is not omics per se, but a track. We are trying to support multi-track and multiomics. Ultimately BALSAMIC will also have RNA processing enabled.
As I mentioned yesterday, we can discard relapse/remission for cancer in Scout, but by enabling a linking between cases on subject-id level.
Balsamic and Scout: enable linking between remission/relapse and DNA|RNA.
Beautiful start! It might be enough, depending on how one views the case concept, but I would for the sake of clarifying this argue the treatise needs one more level of abstraction, that you repeatedly mention but only as a real world concept. The model class would be
and that the case level would change to
This way a case can consist in e.g. finding the cause of a rare disorder in a family or cohort, with the help of (yet un-integrated) DNA and/or RNA and say small molecule information from some individuals from said group.
One thing we forgot yesterday, was analysis types for just QC for those customers that only need fastq delivered. So analysis type in order form should support this as well. What do you think?
Yes, that would be a specific analysis (e.g. balsamic_qc) and delivery (fastq)
@dnil Thx. I added an analysis level definition. I think it is important that each case have a single responsibility of defining a single analysis to run e..g mip dna analysis. If the same sample also should be part of a cancer analysis - another unique case id with a sample_id constellation will be generated to take responsibility for that analysis. They can then be connected through the subject_id, if desired.
Ok, I would for the sake of taking it to its conclusion argue that that would defeat the purpose of the case level. What would you then call the level that groups multiple analyses into one story of a particular disorder? Say the disruption of splicing in a gene for a child with one affected and one unaffected siblings, where RNA is available for two of the children, and DNA for the parents and children? Seen from our perspective the unit for which results from analyses of different molecular species and individuals is (potentially integrated) and presented to an interpreting investigator? Or tumor and normal analysis for RNA, DNA and protein?
The unit of focus for interpretation cannot be the individual, as families are important. It cannot be the analysis as different molecular species or modes of analysis can have been employed. If case is one-to-one mapped to analysis it follows that it can also not be case. But if a case can contain multiple analyses, that works. šø
In an ideal world case compositions would be made quite separate from primary and secondary analysis. Setting up a typical case with some individuals might entail a default portfolio of analyses, each for one sample. Computation on the case level would then involve tools like Genmod - which doubles to prepare RD DNA single sample, single analyses for display, as well as families or more complicated arrangements of individuals - or similar tools for combining RNA and DNA data in splice prediction, or tracing cancer clones through multiple remission samples.
I see your point and I am thinking that this linking of multiple cases will be done from the subject_id level. For example,
Representing subjects (or perhaps modernly called individuals) is a great improvement over the current situation. I would however argue the example above already shows that it cannot conveniently fill the role of an information integrating level. Individuals can, for various purposes be part of different cases, such as the affected sib in one quad, and the unaffected in another one. This could arguably be represented cleanly from the individual level. The model with distinct analysis breaks down - or at least very clearly lacks a layer of abstraction - when different molecular species are used to inform the cases.
In the example above there are several instances where data of different types inform the case. E.g. the RNA is said to "update" case_1. What will the result be? Not a pure "case" any longer surely. The same goes if DNA ranking is informed by RNA. No longer one analysis type, so not a case.
All the above are easily fit into a model with (multiple) cases with multiple analyses, but things start to sound contradictory as soon as you remove the "multiple" part. Note that you use undefined language acting on and connecting cases (analyses) such as "update", "reprocess", "connected to" when you refer to properties of an abstract entity within CG that would represent the different analyses (or cases) and have knowledge of how they fit together sometimes, but not other times. I would for simplicity call this entity "case", but one can pick another word so as not to confuse it with what we have in Scout / genmod today.
Perhaps continue this discussion in a dedicated meeting?
But this is so much fun! š
It's almost like taking a philosophy class..
To muuuuch teeeexxxttt .....
I would for simplicity call this entity "case", but one can pick another word so as not to confuse it with what we have in Scout / genmod today.
Case is what is used in cg (the tool) today and the definition does not differ except for not allowing multiple analysis pipelines. Today, Scout cannot load a case with multiple analyses. It will depend on which analysis for a case that finishes last. The previous will be overwritten. If a case has multiple analyses - how will each application know what to expect and how to process them (Housekeeper, Trailblazer, Scout)?
Representing subjects (or perhaps modernly called individuals) is a great improvement over the current situation.
Sure, perhaps individual_id is better
I would however argue the example above already shows that it cannot conveniently fill the role of an information integrating level. Individuals can, for various purposes be part of different cases, such as the affected sib in one quad, and the unaffected in another one.
Yes, that would be 2 different cases with 2 different analysis to run. Whether you want to use the subject_id to link them or not in a downstream application depends on the context
This could arguably be represented cleanly from the individual level. The model with distinct analysis breaks down - or at least very clearly lacks a layer of abstraction - when different molecular species are used to inform the cases.
Yes, in Scout this would be adding a link from a case to another case. But is this not want we want? You could upload the RNA case to Scout and then in the Scout link them or via the cli (through subject_id) and call this process and entity something else. Maybe "linked-case_analysis". In Scout sample_name is used to point to an individual, but it is in reality a molecular library of an individual, but I think the Scout model does not have to change right know.
In the example above there are several instances where data of different types inform the case. E.g. the RNA is said to "update" case_1. What will the result be? Not a pure "case" any longer surely. The same goes if DNA ranking is informed by RNA. No longer one analysis type, so not a case.
Yes, this is by no means easy. However, the orginal cases from the cg point of view should be untoched. Any post processing should spawn new files and not update old files. This should spawn a new case with a perhaps a analysis type of "genmod_link_cases_analysis" and delivery: Scout.
All the above are easily fit into a model with (multiple) cases with multiple analyses, but things start to sound contradictory as soon as you remove the "multiple" part. Note that you use undefined language acting on and connecting cases (analyses) such as "update", "reprocess", "connected to" when you refer to properties of an abstract entity within CG that would represent the different analyses (or cases) and have knowledge of how they fit together sometimes, but not other times.
Cg would know and decide based on how we design the processes.
But yes - this is really hard to do in this forum
And I would never be able to be precise long enough in spoken language - I would have hand-waved it long ago! Its so obvious..
To muuuuch teeeexxxttt .....
@dnil train on twitter! š¤£
Case is what is used in cg (the tool) today and the definition does not differ except for not allowing multiple analysis pipelines. Today, Scout cannot load a case with multiple analyses. It will depend on which analysis for a case that finishes last. The previous will be overwritten. If a case has multiple analyses - how will each application know what to expect and how to process them (Housekeeper, Trailblazer, Scout)?
That is a question to the crux of the matter! The case needs to contain information about what components it has - or rather perhaps have those component and how the investigator wishes to integrate it. If you one-to-one map case and analysis that gets lost and has to be described somewhere else. You will not be able to deduce automagically how to treat the different analyses or individuals without specification. External information is needed. Where would you encode and act on that? Presumably not just in some lines somewhere in CG, but some config, for some aggregation entity right?
I would much rather have a yaml or such describing the case as a whole, which could be used both to launch the different primary analyses and different integrating analyses in trailblazer, and what data to ask HK for in e.g. Scout.
To muuuuch teeeexxxttt .....
@dnil train on twitter! š¤£
And see how much good that is doing the world? š
That is a question to the crux of the matter! The case needs to contain information about what components it has - or rather perhaps have those component and how the investigator wishes to integrate it.
Yes, by including the sample_ids, type of analysis and delivery
If you one-to-one map case and analysis that gets lost and has to be described somewhere else.
Exactly, this is intentional. If a case has multiple responsibilities - things become very complicated across the systems really fast as each application need to know and handle many different situations. In this model the case will hava single responisbility and end-point.
You will not be able to deduce automagically how to treat the different analyses or individuals without specification. External information is needed.
Yes, and that is precisey why it deserves another process. Which has the sole purpose of linking information or postprocessing multiple analysies. This process most likely has: sample_ids, type of analysis and delivery. Maybe we should rename the "case_id" in this model to "case_processes"
Where would you encode and act on that? Presumably not just in some lines somewhere in CG, but some config, for some aggregation entity right? Once again , it will depend on the context and the defined case_process that should be run. But at least the process can be limited in scope to only the applications that are the intended target of the process and not the entire system.
I would much rather have a yaml or such describing the case as a whole, which could be used both to launch the different primary analyses and different integrating analyses in trailblazer, and what data to ask HK for in e.g. Scout.
This is what we have today and it is broken across cg, hk, mip, balsamic, tb and Scout.
Let's try points:
I would start from a few use "cases" - imagined referrals in a multi-omics enabled lab - and see what components you need. I am convinced you will see complications spawning from trying to handle all integration implicitly by automagically pulling different information for the same individual without regard to the medical investigation at hand.
Micro samples do not have case_id currently. If the plan is to implement that and subject_id/individual_id, I think subject_id is probably better fit to include micro samples as well.
I agree! individual_id is too human centric!
subject_id, or maybe entitiy_id would be more flexible =P
It should be clear, but as generic as possible, and easy to understand for us and our customers.
In case of mord samples we may also not know the individual_id precisely. Are micro samples sometimes not taken directly from individuals but rather from points in the environment? sample_spawner_id?
Stale issue, closing
Aim
Enable multiple omics at Clinical Genomics across all operations
Background
To enable the introduction of multiple omics (DNA, RNA, Cancer) throughout Clinical Genomics the order portal and StatusDB should not hold analysis and delivery information on each individual sample_id. This complicates downstreams process, which requires that each application will have to be pipeline aware on the sample_id level with logic in place at every decision point to interpret which analysis that has been performed and how to proceed. Furthermore, there is a need to be able to couple 1..N sample_ids to the sample origin i.e the actual subject/person/microbe that the sample was isolated from. This level is currently missing at Clinical Genomics.
Definitions
Proposed solution
Samples
Case_analysis
Analysis
Subject
Case
Since each case_analysis will be unique with a single responsibility to define:
For sample_ids that should be processed by multiple analysis or have multiple deliveries:
Each sample_id will can be connected to a single subject_id allowing:
Example:
Samples
Sample: lims_id_1 Subject_id: Kalle Sample: lims_id_2 Subject_id: Pelle Sample: lims_id_3 Subject_id: Kalle
Case
Subject_ids: Kalle, Pelle Case_analysis: funcobra, quicksnail, crackpanda Analysis_methods: mip-dna, mip-rna Display_name: 2020_1
Case_analyses:
funcobra:
Sample_ids: lims_id_1, lims_id_2 Analysis_method: mip-dna Delivery_method: scout
quicksnail:
Sample_ids: lims_id_3 Analysis_method: mip-rna Delivery_method: scout
crackpanda
Sample_ids: lims_1, lims_id_3 Analysis_method: combine_mip-dna_mip-rna Delivery_method: scout-combina-analyses
Task: