NCBI-Hackathons / NCBI_August_Hackathon_Push_Button_Genomics_Solution

Creative Commons Zero v1.0 Universal
5 stars 2 forks source link

Write script to filter VCF #8

Open eweitz opened 9 years ago

eweitz commented 9 years ago

Write a script to apply the user's filter selections to an annotated VCF.

Input parameters:

Output:

@nicovbing, this is the issue we discussed.

eweitz commented 9 years ago

@vlaufer, for a complete list of filters, see https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/blob/86be0185a497490e01997051d93281dfba7d4a5b/django/browser/templates/browser/upload.html#L34. The facets are in the 'legend' elements, e.g. "Clinical significance"; filters are in the input checkbox elements, e.g. "Pathogenic".

eweitz commented 9 years ago

@vlaufer, following up on https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/pull/15#issuecomment-129463757, please refer to my Gitter comments from last night beginning at https://gitter.im/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution?at=55c8206b21801cd866ca7c22.

For convenience, I've copied them below:

Regarding "am i correct in assuming that currently the front-end only supports AND statements for variant filtering?" -- not quite.

Filters within a facet get combined with a Boolean OR; whereas filters across different facets get combined with a Boolean AND

Consider the facets and filters at left in http://www.ncbi.nlm.nih.gov/variation/view/#data If one were to select "Variant type: single nucleotide variant" and "Variant type: insertion", then the result set would be variants that are of "Variant type: single nucleotide variant" OR "Variant type: insertion" -- e.g. http://www.ncbi.nlm.nih.gov/variation/view/?filters=vartype:single-nucleotide-variant,insertion#data

(Here "Variant type" is a facet, and "single nucleotide variant" and "insertion" are filters within that facet.)

Now consider the filter selection "Variant type: single nucleotide variant", "Variant type: insertion", "Clinical significance: Pathogenic"

In Boolean terms, that selects variants that have ("Variant type: single nucleotide variant" OR "Variant type: insertion") AND "Clinical significance: Pathogenic. For example, http://www.ncbi.nlm.nih.gov/variation/view/?filters=vartype:single-nucleotide-variant,insertion+clinsig:pathogenic#data

My recommendation: get the simple case of applying filters within a facet working (e.g. "Molecular consequence: missense" OR "Molecular consequence: nonsense"), then enhance that to work with filters in other facets (e.g. add on "Clinical significance: Pathogenic" to the molcons filter selections) Example of how filters could be passed into your script: "molcons:missense,nonsense+clinsig:pathogenic".

Regarding the complete list of filters, please see https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/8#issuecomment-129294038

In addition to the example linked in my previous comment, see also the mockup of the Upload page -- https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/blob/master/docs/mockup_Upload%20page.png.

Being able to view the filters in an actual UI from a web browser would be ideal, but we will be using a different instance (a copy of the machine as of tomorrow at 10 AM) shortly. I'll email the group about how to work with the new instance this week -- hopefully it will be as simple as changing the machine we SSH into.

In brief, if you can get your script working with input to a "filters" argument like "molcons:missense,nonsense+clinsig:pathogenic", using the logic described in my quoted comment above, then I think that will largely get this feature implemented.

eweitz commented 9 years ago

I've reassigned this from Nicolas (@nicovbing) to Vincent (@vlaufer).

Nicolas began developing a general VCF filtering feature in an awk script, but switched to working on the "Predicted impact" script (https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/13). The commit linked from there shows a script to filter specifiically for "Predicted impact".

Vincent is developing a generic VCF filtering module that will be called from Django.

dauss75 commented 9 years ago

Just to follow up the current progress-- other than the filtering feature, all the functions in backend is complete and linked to the frontend?

eweitz commented 9 years ago

@dauss75, no, I think we still need to integrate the various back-end pipeline components, and I know there's still work to do in Django and the UI.

I'll try to comment here later today with a more detailed overview of what remains to be done. I should be in our Gitter chat room from 7:00 PM - 10:00 PM Eastern Time tonight if anyone wants discuss that way.

ohjuarez commented 9 years ago

Eric, Please let me know if you need help. Regards,

Octavio Juarez-Espinosa, PhD Contractor –MSC, Inc. Scripting Developer Computational Biology Section Bioinformatics and Computational Biosciences Branch (BCBB) OCICB/OSMO/OD/NIAID/NIH

5601 Fishers Lane, Room 4A72 Rockville, MD 20852 Mobile 2407628157 Office 12406692760

From: Eric Weitz notifications@github.com<mailto:notifications@github.com> Reply-To: DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, August 10, 2015 at 11:03 AM To: DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution NCBI_August_Hackathon_Push_Button_Genomics_Solution@noreply.github.com<mailto:NCBI_August_Hackathon_Push_Button_Genomics_Solution@noreply.github.com> Subject: Re: [NCBI_August_Hackathon_Push_Button_Genomics_Solution] Write script to filter VCF (#8)

@dauss75https://github.com/dauss75, no, I think we still need to integrate the various back-end pipeline components, and I know there's still work to do in Django and the UI.

I'll try to comment here later today with a more detailed overview of what remains to be done. I should be in our Gitter chat roomhttps://gitter.im/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution from 7:00 PM - 10:00 PM Eastern Time tonight if anyone wants discuss that way.

— Reply to this email directly or view it on GitHubhttps://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/8#issuecomment-129483138.

eweitz commented 9 years ago

Thanks Octavio! I did some work using the Solr output last night, and will indeed likely need help making some adjustments. I will open a separate issue for that soon, and ping you from there.

eweitz commented 9 years ago

Just to follow up the current progress-- other than the filtering feature, all the functions in backend is complete and linked to the frontend?

@dauss75, @vlaufer, please see https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/16, which describes what I think is the largest remaining chunk of integration work needed to link the Django front-end and your pipeline backend.

ghost commented 9 years ago

Current syntax to use this script is:

python vcf_filter_v0.0.py base of input vcf_name arg1 arg2 ... argn where: vcf_name is a vcf file found in /home/ubuntu/segun/snakemake.testrun/results and having the suffix .annotated.vcf

for instance,

"dummy_1" would correspond to /home/ubuntu/segun/snakemake.testrun/results/dummy_1.annotated.vcf and the output file would be: /home/ubuntu/segun/snakemake.testrun/results/dummy_1.annotated.filtered.vcf

Thereafter the arguments are the same as the checkbox inputs, and should be input space as space delimited.

eweitz commented 9 years ago

Thanks @vlaufer.

You noted problems in getting SnpEff / Python-based VCF filtering to work in https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/16#issuecomment-129720510. (N.B: Unanticipated difficulty is normal in software development!) Depending on what you prefer, we do do have the option of bypassing the "1.0" filtering being implemented here and shifting our focus to filtering directly in Solr.

This issue -- and our basic UI design -- were created without knowing that we'd get a Solr developer added to our team and a Solr API working. When @ohjuarez serendipitously came on board late Tuesday morning, we remained on the path set late Monday afternoon due to time constraints in an effort to develop something demo-able by Wednesday. We ended up getting basic hooks into the Solr API from the view layer by mid Wednesday, but we're not leveraging Solr to its full capacity.

The risk in bypassing this filtering and going directly to Solr is that we don't get a basic application with filtering working end-to-end -- i.e. select a few filters, upload a file, get sent to a page showing filtered results for that upload. Continuing on our current SnpEff / Python approach to filtering (even if it isn't technically ideal) would likely get us to that basic state sooner, and could also help develop knowledge of SnpEff and Python.

If we continue this filtering approach, then when we finish it we could tag a release in GitHub and perhaps image an AMI that contains the working application. People could use the application by spinning up the AMI. We could then begin the Solr approach.

Or we could bypass the working-application intermediate, and pivot to the Solr approach now. The benefit here would be that we could focus on incorporating liftover / remap from GRCh37 to GRCh38 and "predicted impact" annotation enhancements.

Given that, would you prefer to continue this approach to filtering, or shift our focus to filtering directly in Solr?

ghost commented 9 years ago

@eweitz - I think as the more experienced developer, that is your call.

My python script essentially just designed and then issued commands to bash. So, I decided just to try to rewrite it in bash, which I am doing now. There is a good possibility I will have a script working that does what we want fairly soon, but I think that we should make the decision based on what is best, not what my humble skills allow.

As such, thanks for the offer but I think I'll defer to you to decide. In the meantime, I am going to keep hacking away at this and try to get something that does what we want in either bash or python.

Vincent

eweitz commented 9 years ago

@vlaufer, OK, let's spend another day working on this non-Solr approach to filtering. If that doesn't work by the end of tomorrow, let's pivot to the Solr approach.

ghost commented 9 years ago

Understood. If I can't get to it by then, I'll start on other things.

ghost commented 9 years ago

Hi Eric. I am having to work on a paper I'm writing today and will not be able to get back to work on the project. I am sorry for the hold-up. I think that ultimately the novelty of the Solr approach is substantial and interesting anyway, so maybe its for the best.

ghost commented 9 years ago

@eweitz I realize this is a bit late, but I uploaded a working script that incorporates all the features discussed. it is vcf_filter_v0.0.sh

Example usage:

bash vcf_filter_v0.0.sh dummy_1 SNV missense MODERATE nonsense

will construct then execute the command java -jar /home/ubuntu/vlaufer/snpeff/snpEff/NCBI_August_Hackathon_Push_Button_Genomics_Solution/SnpSift.jar filter ( VC = 'SNV' ) & ( ANN[0].EFFECT has 'missense_variant' | ANN[0].EFFECT has 'nonsense_variant' ) & ( ANN[0].IMPACT has 'MODERATE' ) /home/ubuntu/segun/snakemake.testrun/results/dummy_1.annotated.vcf > /home/ubuntu/segun/snakemake.testrun/results/dummy_1.annotated.filtered.vcf

the first argument must be the file name, and the file name should have the suffix annotated.vcf (but that should be so after vcf annotation anyway)

the remainder of the arguments can be in any order. Because this script uses SnpEff, it can handle any number of transcripts.