Exporting variants from gene symbol page

a1618617 commented 3 years ago

Hi guys,

Is there a way to export variants in the gene symbol page. Example attached. We often get asked "have you seen this Gene X in your cohort".

Thanks

Screen Shot 2021-01-11 at 9 23 45 am

sksmi commented 3 years ago

Not currently, have been thinking about whether this is required feature - currently my recommendation is to use an analysis bca..

it's repeatable & auditable
it's designed for download, filtering etc.
much more control over variant zygosity, counts etc.
better for sharing with the team/collab. etc.

https://variantgrid.com/analysis/2248/

@davmlaw thoughts?

a1618617 commented 3 years ago

How well will VG cope with having >100 exomes @ 100X coverage in a cohort analysis, It seems memory and storage intensive for querying just one gene under say a rare variant frequency.

sksmi commented 3 years ago

Not sure you'd want to do this via a cohort analysis for 100 exomes.. not really in a position to comment, but guessing Dave stores variants for the all variants node (e.g. as used in the analysis linked above) differently bca it's very fast.

@davmlaw Also wondering whether instead of having a list of variants on the gene page it would be better to have a link to create a gene-specific analysis..? e.g. https://variantgrid.com/analysis/2249/ (not this exact analysis, just an example). Could also make the gene page quicker to load as well.?

a1618617 commented 3 years ago

@sksmi

I need a baby between your analysis above and a cohort of unresolved/partially resolved cases (~100). What is the best way to achieve this.

Is there a way to search for variants in a gene across all samples in a project eg. Genomic Autopsy?

sksmi commented 3 years ago

Um.. not sure I can help with the baby bit, but can prob solve the cohort q. ;)

Assuming you mean that you'd like to analyse this gene in a cohort? Need a bit more info before I can help...

a1618617 commented 3 years ago

Ideally what I would want to do is to take all samples from VCFs assigned to "Genomic Autopsy" > create a cohort from this. Take the cross from your above analysis (Variants in Database) use the cohort node to look for variants in my cohort of interest.

sksmi commented 3 years ago

For this gene given there's only about 5 variants of interest, I'd stick with the all variants node approach as you'll see pretty quickly what's GA. For a more general solution, you guys probably want to make a GA cohort in VG (I'd actually make 3 - mother, father & affecteds, so you can combine/subtract as needed) to use for these sorts of exploratory analyses..

It'll take a bit to make & run if you create it now given the # of samples, but it's possible to generate it and use it in the future - pretty quick once the cohort has been created. You might also want to update the GA data management SOP so that every time a new vcf is uploaded it's also added to the existing cohort as you'll all want to share the same cohort(s) rather than make your own.

Cohort approach is of course only useful if you want to make statements about GA specifically, otherwise if it's just a variant screen I'd still stick with the all variants nodes as who knows what might come up. Caveat is always that the samples haven't been joint called..

davmlaw commented 3 years ago

Sarah's right about the best way to do it being the all variants node + gene symbol.

But as it was easy to add download_grid_json_as_csv=True to that grid, you can now download it (well, next upgrade)

sksmi commented 3 years ago

@davmlaw can you confirm which variants are/aren't shown on the gene page? My understanding is that the table is filtered by zygosity calls, e.g. hom_ref somatics won't be visible? Will add details to docs.

davmlaw commented 3 years ago

You can see if we have any of:

germline counts >= 1 (ie any het/hom alt)
classification
variant tag

sksmi commented 3 years ago

Download csv works.

Added text & new page to docs: https://github.com/SACGF/variantgrid_docs/blob/master/genes/gene_page.md https://github.com/SACGF/variantgrid_docs/blob/master/genes/gene_symbol.md

sksmi commented 3 years ago

@davmlaw can you do a quick review of the pages above and check all ok. Also, couldn't format page for some reason.

davmlaw commented 3 years ago

Gene symbol was lacking a ".md" extension, and added it to index so it shows.

I filled out the genes page with more information about how gene annotations work.

a1618617 commented 3 years ago

Hi,

Can we please reopen this for discussion, especially filtering based on cohort.

Hamish has been asking us to look for specific genes in just the GA cohort. Is there a better approach than just using the variant database > filtering by impact/population > manually clicking each variant page to see who the variant belongs to

Thanks Thuong and @PeerArts

davmlaw commented 3 years ago

Create an analysis, create a cohort node for each VCF that contains GA samples, then put a gene filter beneath them?

PeerArts commented 3 years ago

LOL @davmlaw are you kidding me? There are way too many different .vcfs to do that.

I think the 'all variants' node would work ok-ish if the genomics collaboration data wouldn't be in there. Most of all variants we see are in that cohort, possibly because of the freebayes caller.

sksmi commented 3 years ago

@PeerArts haha, we can't organise your data for you - that's your job. There's already a GA cohort - didn't take that long and you can add as you go along.

PeerArts commented 3 years ago

Organising data is not my job, but it would be nice to have an easier way to create cohorts from different .vcfs in VG. @davmlaw, are you happy for us to create a massive GA cohort with >160 trios/quads for this? Maybe VG has improved, but earlier I wasn't able to keep adding samples from our >20 different .vcfs to the same cohort (currently increasing number of .vcfs every other week), because it always broke any analysis I tried to do. I just thought it would make life so much easier if we could 'just' upload .vcfs as GA-project .vcfs and only do a project-based cohort analysis.

davmlaw commented 3 years ago

The biggest issue with cohorts at the moment is the mega VCF that has hundreds of samples and 20M variants - doing anything with that (including creating a new cohort from it) breaks things as I don't have enough free space on the virtual machine to make temporary queries.

Can we move discussion to #322 Multi VCF Analysis - I think I can make a source node that does this for you - maybe using "project" to select VCFs assigned to GA

SACGF / variantgrid

Exporting variants from gene symbol page #181