RNA - aberrant expression and splicing (tomte)

dnil commented 9 months ago

tomte and other pipelines including DROP components will produce shortlists of aberrant expression (OUTRIDER) and splicing (FRASER). These are currently given in tsv files with a bunch of columns, with a somewhat shaky hgnc_symbolas the common descriptor. FRASER files additionaly include genomic coordinates, whereas OUTRIDER have an ENSEMBL geneID.

We would like to display these results to the user, allow navigation to the region of interest in an RNA cram / splice junction track IGV.js view, and connect gene knowledge, gene panels, phenotype and other scout features (case, comments, events etc). If a wgs case exists for the same individual, linking them is perfect, and filtering the DNA changes for the RNA events would be nice.

It is not completely obvious if it would be best to require VCF input instead, or have an intermediary process for good HGNC ID and/or known transcript identifiers as anchor points to feed to Scout, but one could tentatively start sketching views. Only the most important of the info from the rather extensive tsv tables will likely fit in a quickly browsable "variantS" table; the rest will likely have to go to a separate "variant" view.

dnil commented 4 months ago

Seems no longer blocked by gene names / gene id for the clinical track at least! Ignore research for now, or run it like fusion with unlinked gene symbols (they are presumably from the same annotation source). One of the files does not have hgnc_id, but does have symbols that appear to match ensg id that are also available. tsv format still, but it is very close to a bed. Possibly they can convert, but lets prep for tsv parsing.

fevac commented 2 months ago

Hi, what's the plan for this? Could we have the tsv tables from fraser and outrider in scout?

dnil commented 2 months ago

Yes, with hgnc_id or ok gene symbols this should be possible, albeit a little reluctantly - a BED or other standard format would have been preferred.

If its still open when I’m done with Poorpipe I will grab it!

fevac commented 2 months ago

that's great! There are hgnc_ids. Thanks

northwestwitch commented 1 month ago

Examples or outfiles in this folder: /home/proj/stage/analysis/cases/legalpeacock/analyse_transcripts/drop

northwestwitch commented 1 month ago

I mean on this @dnil, I ca start if you are not working on it already

dnil commented 1 month ago

I do have a partial PR. So far it's been testing different ideas for frameworking this, but I think it starts to converge. If you want part of it maybe I can try and break out sub tasks.

dnil commented 1 month ago

What I have outlined is the addition of a new collection, with omics_variants, which can be pertain to wts analysis, be of category outlier and sub_category fraser or outrider. I think this is an ok structure so far, but would be happy for opinions on the naming/structure here. We could skip the omicsvariant level and name the new variant kind outliers instead, if we feel outliers is what we will get for the foreseeable future with transcriptomics and proteomics. I still don't have a good feel for where we will end up with methylation: it could be on the aggregate outlier level, like these, with like "hypermethylation" for a gene or promoter region. This would work well with "outlier". But we could also end short of that and have more like base modification found at chr11, pos 1111111, for 33% of bases on the position. In which case the omics variants is better. I also don't know exactly where we will end up with the difficult gene regions, as called with e.g. paraphase. Clearly a lot more regions than the current SMN and FSHD only, so a general mechanism would be good. Maybe we can fit some of them as regular SNVs or SVs but given the experience from SMN, probably not. In which case omics_variants could serve.

The omics_variants can be clinical or research. Files can be passed in the case config, parsed from tsv, variants loaded and connected to genes. I'm working on the display page.

It still lacks quite a few things:

tests
sample connection
links
update existing case, in particular keeping associated events
some events, like comments, should probably be able to describe omics_variants in addition to ordinary variants
the OMICS variants display should have more clever connectivity to the regular variantS displays, e.g. by allowing for showing all variantS in non-dismissed fraser- and outrider outlier genes

A couple of notes about the latest input (tsv) files:

Frequency annotation is missing from the files. While it is currently the case the events are outliers vis a vis the current full set of controls, this will in all likelihood not remain the case for long. How will the new events found in the new cases be fed back to the control panel?
Research files still lack hgnc_id (and much like RNAfusion keep some gene symbols not easily found on other dbs). But all seem to have ENSG geneID. Remains to be seen if we have enough of those ids populated: we are good on them for genes and transcripts that we parse for other reasons, but not necessarily for everything in ENSEMBL.
Column naming is not so systematic (many in camel case, some that useunderscores, some that mix, some with ALL CAPS, some all lower without any separators). Also some unused columns by the looks of it, esp for fraser? Like GENE_COUNTS_FILE, SPLICE_COUNTS_DIR, DNA_ID aso.

fevac commented 1 month ago

@Lucpen @jemten can you have a look at the comment above? I think it might be good to have a meeting with all of us to make sure we're on the same page and we know which changes have to be done both at the pipeline and scout level

jemten commented 1 month ago

Thanks for condensing the remaining issues and questions @dnil!

We currently haven't set-up a system for feeding in new cases into the database that we have set up to use with FRASER/OUTRIDER. But it's on the todo list. I remember having conversations with you and Jesper regarding the format for saving aberrant events which we would need for annotation.
Regarding naming of columns, that is something we'll look and see if we can harmonise. I believe that the issue is mainly with DROP (OURIDER/FRASER) but we are parsing the file anyways in tomte and we can try to make it more coherent.
I ran into Vincente at ESHG and raised the issue with hgnc_id, but I'm not entirely sure that I managed to convey the need.

I agree with @fevac that it would be good to have an alignment meeting so that we know we're solving the right issues.

Lucpen commented 1 month ago

I agree, it would be easier to have a meeting and discuss all of these.

dnil commented 3 weeks ago

Hi folks! Feedback on if this is good enough for a minimum viable product, especially with ideas about what else we should really show from the start would be very welcome. See https://scout-stage.scilifelab.se/cust000/17161-RNA-5M_downsampled/omics_variants/outliers?variant_type=clinical for a demo!

jemten commented 3 weeks ago

Looks very nice Daniel! I think that it is good enough to start with and then we can see what kind of requests we get. Small comment would be to more clearly convey the type of expression outlier. Instead of having to read the snake case outlier name the expression direction could perhaps be added to the type?

dnil commented 3 weeks ago

Good point! Added a direction icon, or a delta-psi symbol for splicers, and threw in links to the case SNV and SV views as well: Screenshot 2024-06-28 at 12 51 20

Clinical-Genomics / scout

RNA - aberrant expression and splicing (tomte) #4111