Enabling multiple omics at Clinical Genomics

henrikstranneheim commented 4 years ago

Aim

Enable multiple omics at Clinical Genomics across all operations

Background

To enable the introduction of multiple omics (DNA, RNA, Cancer) throughout Clinical Genomics the order portal and StatusDB should not hold analysis and delivery information on each individual sample_id. This complicates downstreams process, which requires that each application will have to be pipeline aware on the sample_id level with logic in place at every decision point to interpret which analysis that has been performed and how to proceed. Furthermore, there is a need to be able to couple 1..N sample_ids to the sample origin i.e the actual subject/person/microbe that the sample was isolated from. This level is currently missing at Clinical Genomics.

Definitions

A sample_id is a identifier of a molecular library (e.g lims_id, sample_display_name)
An case_analysis is a group_id of a set of of sample_id(s), an analysis method and a delivery method
An analysis is belongs to a case_analysis and holds meta data regarding the analysis method
A subject_id is an individual identifier for a person used to link multiple sample_id(s)
A case_id is a group_id of a set of subjects_id(s), case_analysis and analysis methods.

Proposed solution

Samples

Can belong to 1..N case_analyses
Belong to a single subject_id
Do NOT have information on what type of analysis to run
Do NOT have information on what type of delivery to perform

Case_analysis

Can have 1..N number of sample_ids
Sample_ids in a case_analysis , can be linked to different subject_ids
A case_id can hold ONLY ONE type of analysis method to run
A case can hold ONLY ONE type of delivery method to perform
Can be connected to multiple analysis_objects from the analysis method within the case_analysis

Analysis

Has a case_analysis
Has an analysis_method
Has information about meta data regarding the analysis run of the analysis method

Subject

Can have multiple sample_ids
Are not directly connected to any case

Case

Can have multiple subject_ids
Can have multiple case_analysis
Can have multiple analysis_methods

Since each case_analysis will be unique with a single responsibility to define:

What samples to run
What single analysis to run
What single delivery to perform We can simply each downstream process - as no downstream application needs to care about multiple workflows or delivery types

For sample_ids that should be processed by multiple analysis or have multiple deliveries:

The required number of unique case_analysis will be created to match the analysis and delivery requirements

Each sample_id will can be connected to a single subject_id allowing:

From the Case: connecting case_analysis_ids (indirectly via subject_id and their sample_ids) to connect 1..N analysis methods runs with each other (MIP and Scout: DNA|RNA, Balsamic and Scout: Relapse, MicroSalt: Current production analysis| SNV analysis)

Example:

Samples

Sample: lims_id_1 Subject_id: Kalle Sample: lims_id_2 Subject_id: Pelle Sample: lims_id_3 Subject_id: Kalle

Case

Subject_ids: Kalle, Pelle Case_analysis: funcobra, quicksnail, crackpanda Analysis_methods: mip-dna, mip-rna Display_name: 2020_1

Case_analyses:

funcobra:

Sample_ids: lims_id_1, lims_id_2 Analysis_method: mip-dna Delivery_method: scout

quicksnail:

Sample_ids: lims_id_3 Analysis_method: mip-rna Delivery_method: scout

crackpanda

Sample_ids: lims_1, lims_id_3 Analysis_method: combine_mip-dna_mip-rna Delivery_method: scout-combina-analyses

Task:

The order portal should no longer accept mixed analysis types, e.g. MIP+Balsamic
The order portal should have one field for analysis type (Balsamic, MIP, MicroSalt, etc) on case level
The order portal should have one field for delivery type (fastq, Scout etc) on case level
StatusDb should not have data analysis on sample level
StatusDb field "from_sample" should now be renamed into "subject_id"
StatusDb family should hold delivery_type
Current cases in statusDB with mixed analysis type should be splitted in N cases

henrikstranneheim commented 4 years ago

@Clinical-Genomics/bioinfo @Clinical-Genomics/laboratory Thoughts? Feedback much appreciated

northwestwitch commented 4 years ago

I think it's a really good description of the simplest solution available. 🚀

hassanfa commented 4 years ago

👍 very nice summary of yesterday's discussion. This will fix a lot of issues we have currently.

One thing we forgot yesterday, was analysis types for just QC for those customers that only need fastq delivered. So analysis type in order form should support this as well. What do you think?

Couple of comments:

This is multiomics and track support. Cancer is not omics per se, but a track. We are trying to support multi-track and multiomics. Ultimately BALSAMIC will also have RNA processing enabled.
As I mentioned yesterday, we can discard relapse/remission for cancer in Scout, but by enabling a linking between cases on subject-id level.
Balsamic and Scout: enable linking between remission/relapse and DNA|RNA.

dnil commented 4 years ago

Beautiful start! It might be enough, depending on how one views the case concept, but I would for the sake of clarifying this argue the treatise needs one more level of abstraction, that you repeatedly mention but only as a real world concept. The model class would be

Analysis

can contain multiple sample_ids
can be part of multiple cases (but is typically only part of one)
has information about the software (typically single pipeline) and parameters used to run it, its qc and results
is connected to one (package of) deliverables

and that the case level would change to

Case

Can consist of 1..N number of analyses and corresponding deliveries
Sample_ids in a case , can be linked to different subject_ids
Can be connected to multiple analysis_objects (from multiple types of analysis)

This way a case can consist in e.g. finding the cause of a rare disorder in a family or cohort, with the help of (yet un-integrated) DNA and/or RNA and say small molecule information from some individuals from said group.

henrikstranneheim commented 4 years ago

One thing we forgot yesterday, was analysis types for just QC for those customers that only need fastq delivered. So analysis type in order form should support this as well. What do you think?

Yes, that would be a specific analysis (e.g. balsamic_qc) and delivery (fastq)

henrikstranneheim commented 4 years ago

@dnil Thx. I added an analysis level definition. I think it is important that each case have a single responsibility of defining a single analysis to run e..g mip dna analysis. If the same sample also should be part of a cancer analysis - another unique case id with a sample_id constellation will be generated to take responsibility for that analysis. They can then be connected through the subject_id, if desired.

dnil commented 4 years ago

Ok, I would for the sake of taking it to its conclusion argue that that would defeat the purpose of the case level. What would you then call the level that groups multiple analyses into one story of a particular disorder? Say the disruption of splicing in a gene for a child with one affected and one unaffected siblings, where RNA is available for two of the children, and DNA for the parents and children? Seen from our perspective the unit for which results from analyses of different molecular species and individuals is (potentially integrated) and presented to an interpreting investigator? Or tumor and normal analysis for RNA, DNA and protein?

The unit of focus for interpretation cannot be the individual, as families are important. It cannot be the analysis as different molecular species or modes of analysis can have been employed. If case is one-to-one mapped to analysis it follows that it can also not be case. But if a case can contain multiple analyses, that works. 😸

dnil commented 4 years ago

In an ideal world case compositions would be made quite separate from primary and secondary analysis. Setting up a typical case with some individuals might entail a default portfolio of analyses, each for one sample. Computation on the case level would then involve tools like Genmod - which doubles to prepare RD DNA single sample, single analyses for display, as well as families or more complicated arrangements of individuals - or similar tools for combining RNA and DNA data in splice prediction, or tracing cancer clones through multiple remission samples.

henrikstranneheim commented 4 years ago

I see your point and I am thinking that this linking of multiple cases will be done from the subject_id level. For example,

Case_1 (dna): A mip-dna analysis is ordered with a subject_id
- The mip-dna analysis is performed for the case and uploaded to Scout (no causatives was found).
- RNA is ordered.
Case_2 (RNA): Is linked to Case_1 in the order portal via the subject_id.
- Cg will let the mip-rna analysis know that an mip-dna analysis has previously been process and modify the mip command accordingly.
- Cg knows that the RNA BAM and arriba report should be connected to the mip-dna case in Scout via the subject_id and updates the Case_1 in Scout with the links from the RNA analysis via the Scout cli (e.g. scout update case Case_1 [RNA_data])
- We could also update the ranking of the DNA variants using the RNA data as Cg know we have to independent cases with analysis done. Genmod can then be launched to update the DNA vcf rank scores taking into account the RNA info (and vice versa). Scout can then be updated for the DNA case with reprocessed variants.
- We can also choose not to use the subject level and upload the RNA case itself to show called RNA variants for instance, without any connection to Case_1.

dnil commented 4 years ago

Representing subjects (or perhaps modernly called individuals) is a great improvement over the current situation. I would however argue the example above already shows that it cannot conveniently fill the role of an information integrating level. Individuals can, for various purposes be part of different cases, such as the affected sib in one quad, and the unaffected in another one. This could arguably be represented cleanly from the individual level. The model with distinct analysis breaks down - or at least very clearly lacks a layer of abstraction - when different molecular species are used to inform the cases.

In the example above there are several instances where data of different types inform the case. E.g. the RNA is said to "update" case_1. What will the result be? Not a pure "case" any longer surely. The same goes if DNA ranking is informed by RNA. No longer one analysis type, so not a case.

All the above are easily fit into a model with (multiple) cases with multiple analyses, but things start to sound contradictory as soon as you remove the "multiple" part. Note that you use undefined language acting on and connecting cases (analyses) such as "update", "reprocess", "connected to" when you refer to properties of an abstract entity within CG that would represent the different analyses (or cases) and have knowledge of how they fit together sometimes, but not other times. I would for simplicity call this entity "case", but one can pick another word so as not to confuse it with what we have in Scout / genmod today.

moonso commented 4 years ago

Perhaps continue this discussion in a dedicated meeting?

dnil commented 4 years ago

But this is so much fun! 😉

dnil commented 4 years ago

It's almost like taking a philosophy class..

moonso commented 4 years ago

To muuuuch teeeexxxttt .....

henrikstranneheim commented 4 years ago

I would for simplicity call this entity "case", but one can pick another word so as not to confuse it with what we have in Scout / genmod today.

Case is what is used in cg (the tool) today and the definition does not differ except for not allowing multiple analysis pipelines. Today, Scout cannot load a case with multiple analyses. It will depend on which analysis for a case that finishes last. The previous will be overwritten. If a case has multiple analyses - how will each application know what to expect and how to process them (Housekeeper, Trailblazer, Scout)?

Representing subjects (or perhaps modernly called individuals) is a great improvement over the current situation.

Sure, perhaps individual_id is better

I would however argue the example above already shows that it cannot conveniently fill the role of an information integrating level. Individuals can, for various purposes be part of different cases, such as the affected sib in one quad, and the unaffected in another one.

Yes, that would be 2 different cases with 2 different analysis to run. Whether you want to use the subject_id to link them or not in a downstream application depends on the context

This could arguably be represented cleanly from the individual level. The model with distinct analysis breaks down - or at least very clearly lacks a layer of abstraction - when different molecular species are used to inform the cases.

Yes, in Scout this would be adding a link from a case to another case. But is this not want we want? You could upload the RNA case to Scout and then in the Scout link them or via the cli (through subject_id) and call this process and entity something else. Maybe "linked-case_analysis". In Scout sample_name is used to point to an individual, but it is in reality a molecular library of an individual, but I think the Scout model does not have to change right know.

In the example above there are several instances where data of different types inform the case. E.g. the RNA is said to "update" case_1. What will the result be? Not a pure "case" any longer surely. The same goes if DNA ranking is informed by RNA. No longer one analysis type, so not a case.

Yes, this is by no means easy. However, the orginal cases from the cg point of view should be untoched. Any post processing should spawn new files and not update old files. This should spawn a new case with a perhaps a analysis type of "genmod_link_cases_analysis" and delivery: Scout.

All the above are easily fit into a model with (multiple) cases with multiple analyses, but things start to sound contradictory as soon as you remove the "multiple" part. Note that you use undefined language acting on and connecting cases (analyses) such as "update", "reprocess", "connected to" when you refer to properties of an abstract entity within CG that would represent the different analyses (or cases) and have knowledge of how they fit together sometimes, but not other times.

Cg would know and decide based on how we design the processes.

henrikstranneheim commented 4 years ago

But yes - this is really hard to do in this forum

dnil commented 4 years ago

And I would never be able to be precise long enough in spoken language - I would have hand-waved it long ago! Its so obvious..

northwestwitch commented 4 years ago

To muuuuch teeeexxxttt .....

@dnil train on twitter! 🤣

dnil commented 4 years ago

Case is what is used in cg (the tool) today and the definition does not differ except for not allowing multiple analysis pipelines. Today, Scout cannot load a case with multiple analyses. It will depend on which analysis for a case that finishes last. The previous will be overwritten. If a case has multiple analyses - how will each application know what to expect and how to process them (Housekeeper, Trailblazer, Scout)?

That is a question to the crux of the matter! The case needs to contain information about what components it has - or rather perhaps have those component and how the investigator wishes to integrate it. If you one-to-one map case and analysis that gets lost and has to be described somewhere else. You will not be able to deduce automagically how to treat the different analyses or individuals without specification. External information is needed. Where would you encode and act on that? Presumably not just in some lines somewhere in CG, but some config, for some aggregation entity right?

I would much rather have a yaml or such describing the case as a whole, which could be used both to launch the different primary analyses and different integrating analyses in trailblazer, and what data to ask HK for in e.g. Scout.

To muuuuch teeeexxxttt .....

@dnil train on twitter! 🤣

And see how much good that is doing the world? 😜

henrikstranneheim commented 4 years ago

That is a question to the crux of the matter! The case needs to contain information about what components it has - or rather perhaps have those component and how the investigator wishes to integrate it.

Yes, by including the sample_ids, type of analysis and delivery

If you one-to-one map case and analysis that gets lost and has to be described somewhere else.

Exactly, this is intentional. If a case has multiple responsibilities - things become very complicated across the systems really fast as each application need to know and handle many different situations. In this model the case will hava single responisbility and end-point.

You will not be able to deduce automagically how to treat the different analyses or individuals without specification. External information is needed.

Yes, and that is precisey why it deserves another process. Which has the sole purpose of linking information or postprocessing multiple analysies. This process most likely has: sample_ids, type of analysis and delivery. Maybe we should rename the "case_id" in this model to "case_processes"

Where would you encode and act on that? Presumably not just in some lines somewhere in CG, but some config, for some aggregation entity right? Once again , it will depend on the context and the defined case_process that should be run. But at least the process can be limited in scope to only the applications that are the intended target of the process and not the entire system.

I would much rather have a yaml or such describing the case as a whole, which could be used both to launch the different primary analyses and different integrating analyses in trailblazer, and what data to ask HK for in e.g. Scout.

This is what we have today and it is broken across cg, hk, mip, balsamic, tb and Scout.

dnil commented 4 years ago

Let's try points:

Individuals are good. 😄 Just implement it.
Individuals won't solve the issue of how to start the right analyses and integration steps magically. There is still going to be complications with specifying that. Introduce a new concept that can contain/act on the results of multiple analyses - whatever you wish to name it.

I would start from a few use "cases" - imagined referrals in a multi-omics enabled lab - and see what components you need. I am convinced you will see complications spawning from trying to handle all integration implicitly by automagically pulling different information for the same individual without regard to the medical investigation at hand.

annaengstrom commented 4 years ago

Micro samples do not have case_id currently. If the plan is to implement that and subject_id/individual_id, I think subject_id is probably better fit to include micro samples as well.

J35P312 commented 4 years ago

I agree! individual_id is too human centric!

subject_id, or maybe entitiy_id would be more flexible =P

annaengstrom commented 4 years ago

It should be clear, but as generic as possible, and easy to understand for us and our customers.

dnil commented 4 years ago

In case of mord samples we may also not know the individual_id precisely. Are micro samples sometimes not taken directly from individuals but rather from points in the environment? sample_spawner_id?

Mropat commented 2 years ago

Stale issue, closing

Clinical-Genomics / cg