arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

release prep #912

Open brentp opened 5 years ago

brentp commented 5 years ago

It's time to make a new release this issue will track what needs to be done. Please add any essential (and only essential) updates as comments to this issue:

jxchong commented 5 years ago

presumably issue #903 is included as well as issue #843

jxchong commented 5 years ago

When you update gnomad, will you include the subcohorts (i.e. frequencies from the controls-only and non-neuro cohorts)?

brentp commented 5 years ago

I updated the list to include those items. thanks for the reminders

pfpjs commented 5 years ago

Taking de novos into account in the _comphet inheritance model would be nice!

arq5x commented 5 years ago

I think it would be great to add support for de novos when reporting candidate CHs. The challenge is that one cannot be confident that it causes a true CH without read-backed phasing in the common case of trios. As such, this would have to either be part of the relaxed priority CH searches or a specific mode.

pfpjs commented 5 years ago

Yeah, the variant could be on the same allele as the inherited one and thus not be a real comp_het, but perhaps a new lenient mode, or a flag for these specific cases would help. Also, this applies for diploid organisms (human specifically, in my case), for other ploidies, things could get ugly fast.

jxchong commented 5 years ago

Agree it'd be nice. I've used this hack to get at the results:

Run comp_hets --max-priority 2 which will give you all potential CH pairs including those that are de_novo + 2nd allele. This yields a lot of false positives, but you can winnow down the results by cross-referencing the gene list with the results of de_novo and looking for overlapping genes.

pfpjs commented 5 years ago

@jxchong, thank you for that hint!

brentp commented 5 years ago

I'll see what I can do about comp-het + DN. Though it can't be phased, it could be a special category.

bgruening commented 5 years ago

@brentp https://github.com/arq5x/gemini/issues/911 and https://github.com/arq5x/gemini/issues/910 are two regressions from the 0.20 version.

wm75 commented 5 years ago

Looks like dgidb-related functionality only requires a URL update to get it working again.

brentp commented 5 years ago

@wm75 can you make a PR or let me know the URL? I'm not familiar with dgidb.

wm75 commented 5 years ago

Sure, just need to get my laptop to look it up.

wm75 commented 5 years ago

Ok, so you'd use:

dgidb_url = 'http://www.dgidb.org/api/v2/interactions.json?genes='

to re-enable things, at least, that's the minimal change.

What I'd prefer though would be if URLs (the dgidb, the install-data url, others?) became part of the config file. This would give, e.g., Galaxy Admins a single place to locate these URLs and update them if they break at some point. I realize that is a slightly bigger change, but if you have a little bit of time before the release, I think it would help a lot with long-term support of any gemini install.

wm75 commented 5 years ago

@bgruening we've been talking about this before. The above line is the patch to https://github.com/arq5x/gemini/blob/8a3e7571b15cec31f701161f6efe38bc624be028/gemini/dgidb.py#L28 that you'd have to apply to gemini on usegalaxy.eu to make the actionable mutations and the --dgidb option of the query tool work again.

brentp commented 5 years ago

great. thank you! so the dgidb stuff is getting used? If so, I can keep it as long as it's this simple URL change to fix.

bgruening commented 5 years ago

@wm75 fixed on our server.

wm75 commented 5 years ago

@bgruening it's working :smile:

@brentp yeah, we have an ongoing project with clinicians, for whom being able to get at gene-drug interactions is a major selling point. So I'd rather like to see more such functionality than less.

brentp commented 5 years ago

When you update gnomad, will you include the subcohorts (i.e. frequencies from the controls-only and non-neuro cohorts)?

@jxchong are these the only additions from gnomad? I'll put a list of what I plan to add for final verification, but an initial set would be helpful.

pfpjs commented 5 years ago

Any possibility of updating CADD to the new v1.4?

brentp commented 5 years ago

I have added the CADD update to the list.

brentp commented 5 years ago

All, I have implemented the comp-het + denovo (CHDN) in a specific way and I'm looking for problems I might not have foreseen. Currently, the highest priority is 1. I have made a "good" CHDN candidate have a priority of 1.5 so one can filter with that (I will change the gemini parameter to accept floats).

A good candidate for kid, mom, dad would be:

# het inherited from mom:
A/T, A/T, A/A
# DN
C/T, C/C, C/C

A somewhat arbitrary requirement that I have added is that the DN HET can not occur in any unaffected sample in the family. So, if we add an unaffected sib to the above trio so it's kid, mom, dad, sib and the sib also has the de novo:

# het inherited from mom only in proband:
A/T, A/T, A/A, A/A
# DN in both kids
C/T, C/C, C/C, C/T 

this is not reported as a candidate since an unaffected (sib) shares the DN. This could miss cases where it was a germ-line mosaic, but I think it will also remove a lot of false-positives.

Any thoughts on this approach? I want to avoid extra flags and arguments and just provide a 95% solution here.

oleraj commented 5 years ago

Another possible addition is gnomad exomes update to 2.1: http://gnomad.broadinstitute.org/downloads

Update -- never mind, just saw you already mentioned gnomad exomes in your first list.

brentp commented 5 years ago

@oleraj that's part of the gnomad item above. I'm testing that now. The file is much larger so I'm trying to cull it a bit.

oleraj commented 5 years ago

Will this be a separate genetic model, or will it be part of the existing comp_hets? The approach seems fine to me for now (others should weigh in too).

brentp commented 5 years ago

it will be part of comp-het and not require any change (other than bumping the max-priority to > 1.5

jxchong commented 5 years ago

@brentp Approach seems fine to me as well. If you are adding CHDN, it'd be nice to add simple multi-gen support to the de_novo model as requested in #885 (i.e. de novo in generation 1, and passed down as autosomal dominant in child in generation 2) as de novo->AD is more common than CHDN. I think it'd be relatively simple to add without requiring a ton of flags because I think you could simply change the requirements such that in a given family, the variant must be de novo at least once (two unaffected parents who are not carriers and their carrier child) while allowing for affected offspring of that child to be het.

brentp commented 5 years ago

@jxchong would you have a look at: https://github.com/arq5x/gemini/commit/513f53fe2a2e07c25af6e7dd27db74ac25388339 ? and verify it meets your needs for the non-neuro and controls AFs? I just took those 2 fields from the VCF and add them to the database.

jxchong commented 5 years ago

@brentp commit 513f53f looks good. Do you think others might have a use for the non-cancer control set too? Perhaps yes, especially when analyzing mosaic variants? @oleraj

brentp commented 5 years ago

Is anyone against removing ESP? Given it's size relative to newer resources, it would drop a few columns...

jxchong commented 5 years ago

Hm, we'd like to see ESP still be included in max_aaf_all (not all ESP samples can be incorporated into gnomAD and if something is relatively common in ESP because of those samples, that's useful info) but don't need to actually output the database-specific frequency.

oleraj commented 5 years ago

I wasn't aware of the cancer samples in gnomad. Just read about this in a previous faq here. Yes, probably would be good to include the non-cancer control set.

brentp commented 5 years ago

ok. I have added that. I am building (and re-building) dbsnp and CADD. Once those are done, my planned development is done. I will not update the gene_detailed and gene_summary stuff. If there are any other high-priority issues, please let me know in the next couple of days.

pfpjs commented 5 years ago

I may be able to tackle the gene_summary and gene_detailed table updates, but I will only be able to do it during next weekend.

brentp commented 5 years ago

@pfpjs that would be great! I will see how things look and may make a release before then and make a point release to incorporate your updates and any changes needed after everyone gets to try the new version.

brentp commented 5 years ago

I have updated the inheritance module so anyone interested can test the comp-het + de novo and the transmitted de novo stuff by doing: pip install -U inheritance to get v0.1.5 before we make the gemini release.

oleraj commented 5 years ago

Some issues from my wish list for potential inclusion:

As you're re-building CADD, is there any way you could include the CADD indels too, preferably merged with the SNPs? #833

Related to dbNSFP, would you be able to address #886 ? Basically, I think we can either modify entries in dbNSFP to convert = to something else while re-building the database for gemini, or just remove this line from geneimpacts/effect.py

assert not "=" in effect_string

We ran into a few errors in running tests last time we tried to install from master, documented in #901, in case there are any quick fixes you might notice. Some of them seem pretty trivial, e.g., gemini/test/unicode.vcf.gz was missing -- probably inadvertently deleted from the repository at some point.

We also had a discussion in #663 around prioritizing canonical transcripts, which seems to be implemented in geneimpacts>=0.3.4. If there is a simple way to incorporate the latest version of geneimpacts and maybe add a --prioritize_canonical flag to gemini load, that would be helpful.

brentp commented 5 years ago

@oleraj I have responded on a few of those issues and #901 is fixed. I will incorporate the latest version of geneimpacts when making the release. For others, I'd prefer those that need that functionality to contribute as there are relatively simple work-arounds for those.

oleraj commented 5 years ago

Great, thanks @brentp.

jxchong commented 5 years ago

Oh! What about including CCRs as an annotation? :)

arq5x commented 5 years ago

Agree; CCRs are in the larger plan, but will leave it to Brent if it will make it into this release.

pfpjs commented 5 years ago

I've managed to fix and update the generation of detailed_gene_table and summary_gene_table, can anyone else take a second look? Thanks!

brentp commented 5 years ago

Awesome! Paulo, thanks so much. I will have a look on Tuesday, then I think we are ready for the next release.

I think we'll let the dust settle from this release and then add CCRs if there is interest. It's simple to add CCRs via gemini annotate (or using the new loader) but I get that it would be convenient to have it as one of the install datasets.

Phillip-a-richmond commented 5 years ago

Late to the party but I look forward to the new release!

Will some aspects of the update (non database-related) also remain functional with the VCFAnno + VCF2DB pipeline? Specifically thinking about gene_summary and gene_detailed, and if there's a way to leverage these improvements within my current BCFTools (soft-filter), SNPeff, VCFAnno, VCF2DB pipeline.

Or do you now suggest switching back to the gemini load + gemini annotate to get extra features (e.g. in-house database).

Thanks! Phil

brentp commented 5 years ago

@Phillip-a-richmond you should continue using the new loader. this will be the last major release using the old loader. If there are features missing when using the new loader, then PR's would be great.

brentp commented 5 years ago

dgidb seems to be broken again. perhaps it is an intermittent server error but I don't plan to fix it. I will rely on those that use dgidb to keep that functioning.