Clinical-Genomics / scout

VCF visualization interface
https://clinical-genomics.github.io/scout
BSD 3-Clause "New" or "Revised" License
148 stars 43 forks source link

RNA - aberrant expression and splicing (tomte) #4111

Open dnil opened 9 months ago

dnil commented 9 months ago

tomte and other pipelines including DROP components will produce shortlists of aberrant expression (OUTRIDER) and splicing (FRASER). These are currently given in tsv files with a bunch of columns, with a somewhat shaky hgnc_symbolas the common descriptor. FRASER files additionaly include genomic coordinates, whereas OUTRIDER have an ENSEMBL geneID.

We would like to display these results to the user, allow navigation to the region of interest in an RNA cram / splice junction track IGV.js view, and connect gene knowledge, gene panels, phenotype and other scout features (case, comments, events etc). If a wgs case exists for the same individual, linking them is perfect, and filtering the DNA changes for the RNA events would be nice.

It is not completely obvious if it would be best to require VCF input instead, or have an intermediary process for good HGNC ID and/or known transcript identifiers as anchor points to feed to Scout, but one could tentatively start sketching views. Only the most important of the info from the rather extensive tsv tables will likely fit in a quickly browsable "variantS" table; the rest will likely have to go to a separate "variant" view.

dnil commented 4 months ago

Seems no longer blocked by gene names / gene id for the clinical track at least! Ignore research for now, or run it like fusion with unlinked gene symbols (they are presumably from the same annotation source). One of the files does not have hgnc_id, but does have symbols that appear to match ensg id that are also available. tsv format still, but it is very close to a bed. Possibly they can convert, but lets prep for tsv parsing.

fevac commented 2 months ago

Hi, what's the plan for this? Could we have the tsv tables from fraser and outrider in scout?

dnil commented 2 months ago

Yes, with hgnc_id or ok gene symbols this should be possible, albeit a little reluctantly - a BED or other standard format would have been preferred.

If its still open when I’m done with Poorpipe I will grab it!

fevac commented 2 months ago

that's great! There are hgnc_ids. Thanks

northwestwitch commented 1 month ago

Examples or outfiles in this folder: /home/proj/stage/analysis/cases/legalpeacock/analyse_transcripts/drop

northwestwitch commented 1 month ago

I mean on this @dnil, I ca start if you are not working on it already

dnil commented 1 month ago

I do have a partial PR. So far it's been testing different ideas for frameworking this, but I think it starts to converge. If you want part of it maybe I can try and break out sub tasks.

dnil commented 1 month ago

What I have outlined is the addition of a new collection, with omics_variants, which can be pertain to wts analysis, be of category outlier and sub_category fraser or outrider. I think this is an ok structure so far, but would be happy for opinions on the naming/structure here. We could skip the omicsvariant level and name the new variant kind outliers instead, if we feel outliers is what we will get for the foreseeable future with transcriptomics and proteomics. I still don't have a good feel for where we will end up with methylation: it could be on the aggregate outlier level, like these, with like "hypermethylation" for a gene or promoter region. This would work well with "outlier". But we could also end short of that and have more like base modification found at chr11, pos 1111111, for 33% of bases on the position. In which case the omics variants is better. I also don't know exactly where we will end up with the difficult gene regions, as called with e.g. paraphase. Clearly a lot more regions than the current SMN and FSHD only, so a general mechanism would be good. Maybe we can fit some of them as regular SNVs or SVs but given the experience from SMN, probably not. In which case omics_variants could serve.

The omics_variants can be clinical or research. Files can be passed in the case config, parsed from tsv, variants loaded and connected to genes. I'm working on the display page.

It still lacks quite a few things:

A couple of notes about the latest input (tsv) files:

fevac commented 1 month ago

@Lucpen @jemten can you have a look at the comment above? I think it might be good to have a meeting with all of us to make sure we're on the same page and we know which changes have to be done both at the pipeline and scout level

jemten commented 1 month ago

Thanks for condensing the remaining issues and questions @dnil!

I agree with @fevac that it would be good to have an alignment meeting so that we know we're solving the right issues.

Lucpen commented 1 month ago

I agree, it would be easier to have a meeting and discuss all of these.

dnil commented 3 weeks ago

Hi folks! Feedback on if this is good enough for a minimum viable product, especially with ideas about what else we should really show from the start would be very welcome. See https://scout-stage.scilifelab.se/cust000/17161-RNA-5M_downsampled/omics_variants/outliers?variant_type=clinical for a demo!

jemten commented 3 weeks ago

Looks very nice Daniel! I think that it is good enough to start with and then we can see what kind of requests we get. Small comment would be to more clearly convey the type of expression outlier. Instead of having to read the snake case outlier name the expression direction could perhaps be added to the type?

dnil commented 3 weeks ago

Good point! Added a direction icon, or a delta-psi symbol for splicers, and threw in links to the case SNV and SV views as well: Screenshot 2024-06-28 at 12 51 20