Clinical-Genomics / scout

VCF visualization interface
https://clinical-genomics.github.io/scout
BSD 3-Clause "New" or "Revised" License
152 stars 46 forks source link

Refactor RNA handling #4816

Closed dnil closed 2 months ago

dnil commented 2 months ago

We are in the process of both adding short term improvements to RNA data loading (e.g. https://github.com/Clinical-Genomics/scout/pull/4815) to get in particular tomte data loading as we did MIP RNA. At the same time reworking the way this is handled in CG Solna integration (https://github.com/Clinical-Genomics/cg/discussions/3669). It would seem a Good Time to rework how cases and analysis is handled in Scout loading.

The primary loading is via scout load case config.yaml. Said yaml load configs (https://clinical-genomics.github.io/scout/admin-guide/load-config/) describe the case and samples:

owner: cust004

family: '1'
samples:
  - analysis_type: wes
    sample_id: NA12878
    capture_kit: Agilent_SureSelectCRE.V1
    father: 0
    mother: 0
    sample_name: NA12878
    phenotype: affected
    sex: male

vcf_snv: scout/demo/643594.clinical.vcf.gz

RNA data is currently added "on top" of other data for a sample, although that sample may or may not be an actual RNA sample. E.g.

owner: cust004

family: '1'
samples:
  - analysis_type: wes
    sample_id: NA12878
    capture_kit: Agilent_SureSelectCRE.V1
    father: 0
    mother: 0
    sample_name: NA12878
    phenotype: affected
    sex: male
    rna_alignment_path: ./RNA12878.cram
    rna_coverage_bigwig: ./tmp/RNA12878.coverage.bw

I propose we instead allow additional WTS and other OMICS samples to enter the samples list. These can then be formally connected to other samples via sharing a subject_id (an already existing LIMS ID key on samples) or by knowledge from the customer analyst. In very few instances does Scout actually need to know who is who, other than to help the analyst not misread sample names when looking at a view with multiple alignments.

The existing RNA and DNA build keys affecting the whole case (including variants multisample VCF files) can still be used to define build for the respective analysis types.