Clinical-Genomics / cg

Glue between Clinical Genomics apps
8 stars 2 forks source link

Enabling multiple omics at Clinical Genomics #744

Closed henrikstranneheim closed 2 years ago

henrikstranneheim commented 4 years ago

Aim

Enable multiple omics at Clinical Genomics across all operations

Background

To enable the introduction of multiple omics (DNA, RNA, Cancer) throughout Clinical Genomics the order portal and StatusDB should not hold analysis and delivery information on each individual sample_id. This complicates downstreams process, which requires that each application will have to be pipeline aware on the sample_id level with logic in place at every decision point to interpret which analysis that has been performed and how to proceed. Furthermore, there is a need to be able to couple 1..N sample_ids to the sample origin i.e the actual subject/person/microbe that the sample was isolated from. This level is currently missing at Clinical Genomics.

Definitions

  1. A sample_id is a identifier of a molecular library (e.g lims_id, sample_display_name)
  2. An case_analysis is a group_id of a set of of sample_id(s), an analysis method and a delivery method
  3. An analysis is belongs to a case_analysis and holds meta data regarding the analysis method
  4. A subject_id is an individual identifier for a person used to link multiple sample_id(s)
  5. A case_id is a group_id of a set of subjects_id(s), case_analysis and analysis methods.

Proposed solution

Samples

Case_analysis

Analysis

Subject

Case

Since each case_analysis will be unique with a single responsibility to define:

  1. What samples to run
  2. What single analysis to run
  3. What single delivery to perform We can simply each downstream process - as no downstream application needs to care about multiple workflows or delivery types

For sample_ids that should be processed by multiple analysis or have multiple deliveries:

Each sample_id will can be connected to a single subject_id allowing:

  1. From the Case: connecting case_analysis_ids (indirectly via subject_id and their sample_ids) to connect 1..N analysis methods runs with each other (MIP and Scout: DNA|RNA, Balsamic and Scout: Relapse, MicroSalt: Current production analysis| SNV analysis)

Example:

Samples

Sample: lims_id_1 Subject_id: Kalle Sample: lims_id_2 Subject_id: Pelle Sample: lims_id_3 Subject_id: Kalle

Case

Subject_ids: Kalle, Pelle Case_analysis: funcobra, quicksnail, crackpanda Analysis_methods: mip-dna, mip-rna Display_name: 2020_1

Case_analyses:

funcobra:

Sample_ids: lims_id_1, lims_id_2 Analysis_method: mip-dna Delivery_method: scout

quicksnail:

Sample_ids: lims_id_3 Analysis_method: mip-rna Delivery_method: scout

crackpanda

Sample_ids: lims_1, lims_id_3 Analysis_method: combine_mip-dna_mip-rna Delivery_method: scout-combina-analyses

Task:

henrikstranneheim commented 4 years ago

@Clinical-Genomics/bioinfo @Clinical-Genomics/laboratory Thoughts? Feedback much appreciated

northwestwitch commented 4 years ago

I think it's a really good description of the simplest solution available. šŸš€

hassanfa commented 4 years ago

šŸ‘ very nice summary of yesterday's discussion. This will fix a lot of issues we have currently.

One thing we forgot yesterday, was analysis types for just QC for those customers that only need fastq delivered. So analysis type in order form should support this as well. What do you think?

Couple of comments:

  1. This is multiomics and track support. Cancer is not omics per se, but a track. We are trying to support multi-track and multiomics. Ultimately BALSAMIC will also have RNA processing enabled.

  2. As I mentioned yesterday, we can discard relapse/remission for cancer in Scout, but by enabling a linking between cases on subject-id level.

  3. Balsamic and Scout: enable linking between remission/relapse and DNA|RNA.

dnil commented 4 years ago

Beautiful start! It might be enough, depending on how one views the case concept, but I would for the sake of clarifying this argue the treatise needs one more level of abstraction, that you repeatedly mention but only as a real world concept. The model class would be

Analysis

and that the case level would change to

Case

This way a case can consist in e.g. finding the cause of a rare disorder in a family or cohort, with the help of (yet un-integrated) DNA and/or RNA and say small molecule information from some individuals from said group.

henrikstranneheim commented 4 years ago

One thing we forgot yesterday, was analysis types for just QC for those customers that only need fastq delivered. So analysis type in order form should support this as well. What do you think?

Yes, that would be a specific analysis (e.g. balsamic_qc) and delivery (fastq)

henrikstranneheim commented 4 years ago

@dnil Thx. I added an analysis level definition. I think it is important that each case have a single responsibility of defining a single analysis to run e..g mip dna analysis. If the same sample also should be part of a cancer analysis - another unique case id with a sample_id constellation will be generated to take responsibility for that analysis. They can then be connected through the subject_id, if desired.

dnil commented 4 years ago

Ok, I would for the sake of taking it to its conclusion argue that that would defeat the purpose of the case level. What would you then call the level that groups multiple analyses into one story of a particular disorder? Say the disruption of splicing in a gene for a child with one affected and one unaffected siblings, where RNA is available for two of the children, and DNA for the parents and children? Seen from our perspective the unit for which results from analyses of different molecular species and individuals is (potentially integrated) and presented to an interpreting investigator? Or tumor and normal analysis for RNA, DNA and protein?

The unit of focus for interpretation cannot be the individual, as families are important. It cannot be the analysis as different molecular species or modes of analysis can have been employed. If case is one-to-one mapped to analysis it follows that it can also not be case. But if a case can contain multiple analyses, that works. šŸ˜ø

dnil commented 4 years ago

In an ideal world case compositions would be made quite separate from primary and secondary analysis. Setting up a typical case with some individuals might entail a default portfolio of analyses, each for one sample. Computation on the case level would then involve tools like Genmod - which doubles to prepare RD DNA single sample, single analyses for display, as well as families or more complicated arrangements of individuals - or similar tools for combining RNA and DNA data in splice prediction, or tracing cancer clones through multiple remission samples.

henrikstranneheim commented 4 years ago

I see your point and I am thinking that this linking of multiple cases will be done from the subject_id level. For example,

dnil commented 4 years ago

Representing subjects (or perhaps modernly called individuals) is a great improvement over the current situation. I would however argue the example above already shows that it cannot conveniently fill the role of an information integrating level. Individuals can, for various purposes be part of different cases, such as the affected sib in one quad, and the unaffected in another one. This could arguably be represented cleanly from the individual level. The model with distinct analysis breaks down - or at least very clearly lacks a layer of abstraction - when different molecular species are used to inform the cases.

In the example above there are several instances where data of different types inform the case. E.g. the RNA is said to "update" case_1. What will the result be? Not a pure "case" any longer surely. The same goes if DNA ranking is informed by RNA. No longer one analysis type, so not a case.

All the above are easily fit into a model with (multiple) cases with multiple analyses, but things start to sound contradictory as soon as you remove the "multiple" part. Note that you use undefined language acting on and connecting cases (analyses) such as "update", "reprocess", "connected to" when you refer to properties of an abstract entity within CG that would represent the different analyses (or cases) and have knowledge of how they fit together sometimes, but not other times. I would for simplicity call this entity "case", but one can pick another word so as not to confuse it with what we have in Scout / genmod today.

moonso commented 4 years ago

Perhaps continue this discussion in a dedicated meeting?

dnil commented 4 years ago

But this is so much fun! šŸ˜‰

dnil commented 4 years ago

It's almost like taking a philosophy class..

moonso commented 4 years ago

To muuuuch teeeexxxttt .....

henrikstranneheim commented 4 years ago

I would for simplicity call this entity "case", but one can pick another word so as not to confuse it with what we have in Scout / genmod today.

Case is what is used in cg (the tool) today and the definition does not differ except for not allowing multiple analysis pipelines. Today, Scout cannot load a case with multiple analyses. It will depend on which analysis for a case that finishes last. The previous will be overwritten. If a case has multiple analyses - how will each application know what to expect and how to process them (Housekeeper, Trailblazer, Scout)?

Representing subjects (or perhaps modernly called individuals) is a great improvement over the current situation.

Sure, perhaps individual_id is better

I would however argue the example above already shows that it cannot conveniently fill the role of an information integrating level. Individuals can, for various purposes be part of different cases, such as the affected sib in one quad, and the unaffected in another one.

Yes, that would be 2 different cases with 2 different analysis to run. Whether you want to use the subject_id to link them or not in a downstream application depends on the context

This could arguably be represented cleanly from the individual level. The model with distinct analysis breaks down - or at least very clearly lacks a layer of abstraction - when different molecular species are used to inform the cases.

Yes, in Scout this would be adding a link from a case to another case. But is this not want we want? You could upload the RNA case to Scout and then in the Scout link them or via the cli (through subject_id) and call this process and entity something else. Maybe "linked-case_analysis". In Scout sample_name is used to point to an individual, but it is in reality a molecular library of an individual, but I think the Scout model does not have to change right know.

In the example above there are several instances where data of different types inform the case. E.g. the RNA is said to "update" case_1. What will the result be? Not a pure "case" any longer surely. The same goes if DNA ranking is informed by RNA. No longer one analysis type, so not a case.

Yes, this is by no means easy. However, the orginal cases from the cg point of view should be untoched. Any post processing should spawn new files and not update old files. This should spawn a new case with a perhaps a analysis type of "genmod_link_cases_analysis" and delivery: Scout.

All the above are easily fit into a model with (multiple) cases with multiple analyses, but things start to sound contradictory as soon as you remove the "multiple" part. Note that you use undefined language acting on and connecting cases (analyses) such as "update", "reprocess", "connected to" when you refer to properties of an abstract entity within CG that would represent the different analyses (or cases) and have knowledge of how they fit together sometimes, but not other times.

Cg would know and decide based on how we design the processes.

henrikstranneheim commented 4 years ago

But yes - this is really hard to do in this forum

dnil commented 4 years ago

And I would never be able to be precise long enough in spoken language - I would have hand-waved it long ago! Its so obvious..

northwestwitch commented 4 years ago

To muuuuch teeeexxxttt .....

@dnil train on twitter! šŸ¤£

dnil commented 4 years ago

Case is what is used in cg (the tool) today and the definition does not differ except for not allowing multiple analysis pipelines. Today, Scout cannot load a case with multiple analyses. It will depend on which analysis for a case that finishes last. The previous will be overwritten. If a case has multiple analyses - how will each application know what to expect and how to process them (Housekeeper, Trailblazer, Scout)?

That is a question to the crux of the matter! The case needs to contain information about what components it has - or rather perhaps have those component and how the investigator wishes to integrate it. If you one-to-one map case and analysis that gets lost and has to be described somewhere else. You will not be able to deduce automagically how to treat the different analyses or individuals without specification. External information is needed. Where would you encode and act on that? Presumably not just in some lines somewhere in CG, but some config, for some aggregation entity right?

I would much rather have a yaml or such describing the case as a whole, which could be used both to launch the different primary analyses and different integrating analyses in trailblazer, and what data to ask HK for in e.g. Scout.

To muuuuch teeeexxxttt .....

@dnil train on twitter! šŸ¤£

And see how much good that is doing the world? šŸ˜œ

henrikstranneheim commented 4 years ago

That is a question to the crux of the matter! The case needs to contain information about what components it has - or rather perhaps have those component and how the investigator wishes to integrate it.

Yes, by including the sample_ids, type of analysis and delivery

If you one-to-one map case and analysis that gets lost and has to be described somewhere else.

Exactly, this is intentional. If a case has multiple responsibilities - things become very complicated across the systems really fast as each application need to know and handle many different situations. In this model the case will hava single responisbility and end-point.

You will not be able to deduce automagically how to treat the different analyses or individuals without specification. External information is needed.

Yes, and that is precisey why it deserves another process. Which has the sole purpose of linking information or postprocessing multiple analysies. This process most likely has: sample_ids, type of analysis and delivery. Maybe we should rename the "case_id" in this model to "case_processes"

Where would you encode and act on that? Presumably not just in some lines somewhere in CG, but some config, for some aggregation entity right? Once again , it will depend on the context and the defined case_process that should be run. But at least the process can be limited in scope to only the applications that are the intended target of the process and not the entire system.

I would much rather have a yaml or such describing the case as a whole, which could be used both to launch the different primary analyses and different integrating analyses in trailblazer, and what data to ask HK for in e.g. Scout.

This is what we have today and it is broken across cg, hk, mip, balsamic, tb and Scout.

dnil commented 4 years ago

Let's try points:

  1. Individuals are good. šŸ˜„ Just implement it.
  2. Individuals won't solve the issue of how to start the right analyses and integration steps magically. There is still going to be complications with specifying that. Introduce a new concept that can contain/act on the results of multiple analyses - whatever you wish to name it.

I would start from a few use "cases" - imagined referrals in a multi-omics enabled lab - and see what components you need. I am convinced you will see complications spawning from trying to handle all integration implicitly by automagically pulling different information for the same individual without regard to the medical investigation at hand.

annaengstrom commented 4 years ago

Micro samples do not have case_id currently. If the plan is to implement that and subject_id/individual_id, I think subject_id is probably better fit to include micro samples as well.

J35P312 commented 4 years ago

I agree! individual_id is too human centric!

subject_id, or maybe entitiy_id would be more flexible =P

annaengstrom commented 4 years ago

It should be clear, but as generic as possible, and easy to understand for us and our customers.

dnil commented 4 years ago

In case of mord samples we may also not know the individual_id precisely. Are micro samples sometimes not taken directly from individuals but rather from points in the environment? sample_spawner_id?

Mropat commented 2 years ago

Stale issue, closing