Open eweitz opened 9 years ago
@vlaufer, for a complete list of filters, see https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/blob/86be0185a497490e01997051d93281dfba7d4a5b/django/browser/templates/browser/upload.html#L34. The facets are in the 'legend' elements, e.g. "Clinical significance"; filters are in the input checkbox elements, e.g. "Pathogenic".
@vlaufer, following up on https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/pull/15#issuecomment-129463757, please refer to my Gitter comments from last night beginning at https://gitter.im/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution?at=55c8206b21801cd866ca7c22.
For convenience, I've copied them below:
Regarding "am i correct in assuming that currently the front-end only supports AND statements for variant filtering?" -- not quite.
Filters within a facet get combined with a Boolean OR; whereas filters across different facets get combined with a Boolean AND
Consider the facets and filters at left in http://www.ncbi.nlm.nih.gov/variation/view/#data If one were to select "Variant type: single nucleotide variant" and "Variant type: insertion", then the result set would be variants that are of "Variant type: single nucleotide variant" OR "Variant type: insertion" -- e.g. http://www.ncbi.nlm.nih.gov/variation/view/?filters=vartype:single-nucleotide-variant,insertion#data
(Here "Variant type" is a facet, and "single nucleotide variant" and "insertion" are filters within that facet.)
Now consider the filter selection "Variant type: single nucleotide variant", "Variant type: insertion", "Clinical significance: Pathogenic"
In Boolean terms, that selects variants that have ("Variant type: single nucleotide variant" OR "Variant type: insertion") AND "Clinical significance: Pathogenic. For example, http://www.ncbi.nlm.nih.gov/variation/view/?filters=vartype:single-nucleotide-variant,insertion+clinsig:pathogenic#data
My recommendation: get the simple case of applying filters within a facet working (e.g. "Molecular consequence: missense" OR "Molecular consequence: nonsense"), then enhance that to work with filters in other facets (e.g. add on "Clinical significance: Pathogenic" to the molcons filter selections) Example of how filters could be passed into your script: "molcons:missense,nonsense+clinsig:pathogenic".
Regarding the complete list of filters, please see https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/8#issuecomment-129294038
In addition to the example linked in my previous comment, see also the mockup of the Upload page -- https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/blob/master/docs/mockup_Upload%20page.png.
Being able to view the filters in an actual UI from a web browser would be ideal, but we will be using a different instance (a copy of the machine as of tomorrow at 10 AM) shortly. I'll email the group about how to work with the new instance this week -- hopefully it will be as simple as changing the machine we SSH into.
In brief, if you can get your script working with input to a "filters" argument like "molcons:missense,nonsense+clinsig:pathogenic", using the logic described in my quoted comment above, then I think that will largely get this feature implemented.
I've reassigned this from Nicolas (@nicovbing) to Vincent (@vlaufer).
Nicolas began developing a general VCF filtering feature in an awk script, but switched to working on the "Predicted impact" script (https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/13). The commit linked from there shows a script to filter specifiically for "Predicted impact".
Vincent is developing a generic VCF filtering module that will be called from Django.
Just to follow up the current progress-- other than the filtering feature, all the functions in backend is complete and linked to the frontend?
@dauss75, no, I think we still need to integrate the various back-end pipeline components, and I know there's still work to do in Django and the UI.
I'll try to comment here later today with a more detailed overview of what remains to be done. I should be in our Gitter chat room from 7:00 PM - 10:00 PM Eastern Time tonight if anyone wants discuss that way.
Eric, Please let me know if you need help. Regards,
Octavio Juarez-Espinosa, PhD Contractor –MSC, Inc. Scripting Developer Computational Biology Section Bioinformatics and Computational Biosciences Branch (BCBB) OCICB/OSMO/OD/NIAID/NIH
5601 Fishers Lane, Room 4A72 Rockville, MD 20852 Mobile 2407628157 Office 12406692760
From: Eric Weitz notifications@github.com<mailto:notifications@github.com> Reply-To: DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, August 10, 2015 at 11:03 AM To: DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution NCBI_August_Hackathon_Push_Button_Genomics_Solution@noreply.github.com<mailto:NCBI_August_Hackathon_Push_Button_Genomics_Solution@noreply.github.com> Subject: Re: [NCBI_August_Hackathon_Push_Button_Genomics_Solution] Write script to filter VCF (#8)
@dauss75https://github.com/dauss75, no, I think we still need to integrate the various back-end pipeline components, and I know there's still work to do in Django and the UI.
I'll try to comment here later today with a more detailed overview of what remains to be done. I should be in our Gitter chat roomhttps://gitter.im/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution from 7:00 PM - 10:00 PM Eastern Time tonight if anyone wants discuss that way.
— Reply to this email directly or view it on GitHubhttps://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/8#issuecomment-129483138.
Thanks Octavio! I did some work using the Solr output last night, and will indeed likely need help making some adjustments. I will open a separate issue for that soon, and ping you from there.
Just to follow up the current progress-- other than the filtering feature, all the functions in backend is complete and linked to the frontend?
@dauss75, @vlaufer, please see https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/16, which describes what I think is the largest remaining chunk of integration work needed to link the Django front-end and your pipeline backend.
Current syntax to use this script is:
python vcf_filter_v0.0.py base of input vcf_name arg1 arg2 ... argn where: vcf_name is a vcf file found in /home/ubuntu/segun/snakemake.testrun/results and having the suffix .annotated.vcf
for instance,
"dummy_1" would correspond to /home/ubuntu/segun/snakemake.testrun/results/dummy_1.annotated.vcf and the output file would be: /home/ubuntu/segun/snakemake.testrun/results/dummy_1.annotated.filtered.vcf
Thereafter the arguments are the same as the checkbox inputs, and should be input space as space delimited.
Thanks @vlaufer.
You noted problems in getting SnpEff / Python-based VCF filtering to work in https://github.com/DCGenomics/NCBI_August_Hackathon_Push_Button_Genomics_Solution/issues/16#issuecomment-129720510. (N.B: Unanticipated difficulty is normal in software development!) Depending on what you prefer, we do do have the option of bypassing the "1.0" filtering being implemented here and shifting our focus to filtering directly in Solr.
This issue -- and our basic UI design -- were created without knowing that we'd get a Solr developer added to our team and a Solr API working. When @ohjuarez serendipitously came on board late Tuesday morning, we remained on the path set late Monday afternoon due to time constraints in an effort to develop something demo-able by Wednesday. We ended up getting basic hooks into the Solr API from the view layer by mid Wednesday, but we're not leveraging Solr to its full capacity.
The risk in bypassing this filtering and going directly to Solr is that we don't get a basic application with filtering working end-to-end -- i.e. select a few filters, upload a file, get sent to a page showing filtered results for that upload. Continuing on our current SnpEff / Python approach to filtering (even if it isn't technically ideal) would likely get us to that basic state sooner, and could also help develop knowledge of SnpEff and Python.
If we continue this filtering approach, then when we finish it we could tag a release in GitHub and perhaps image an AMI that contains the working application. People could use the application by spinning up the AMI. We could then begin the Solr approach.
Or we could bypass the working-application intermediate, and pivot to the Solr approach now. The benefit here would be that we could focus on incorporating liftover / remap from GRCh37 to GRCh38 and "predicted impact" annotation enhancements.
Given that, would you prefer to continue this approach to filtering, or shift our focus to filtering directly in Solr?
@eweitz - I think as the more experienced developer, that is your call.
My python script essentially just designed and then issued commands to bash. So, I decided just to try to rewrite it in bash, which I am doing now. There is a good possibility I will have a script working that does what we want fairly soon, but I think that we should make the decision based on what is best, not what my humble skills allow.
As such, thanks for the offer but I think I'll defer to you to decide. In the meantime, I am going to keep hacking away at this and try to get something that does what we want in either bash or python.
Vincent
@vlaufer, OK, let's spend another day working on this non-Solr approach to filtering. If that doesn't work by the end of tomorrow, let's pivot to the Solr approach.
Understood. If I can't get to it by then, I'll start on other things.
Hi Eric. I am having to work on a paper I'm writing today and will not be able to get back to work on the project. I am sorry for the hold-up. I think that ultimately the novelty of the Solr approach is substantial and interesting anyway, so maybe its for the best.
@eweitz I realize this is a bit late, but I uploaded a working script that incorporates all the features discussed. it is vcf_filter_v0.0.sh
Example usage:
bash vcf_filter_v0.0.sh dummy_1 SNV missense MODERATE nonsense
will construct then execute the command java -jar /home/ubuntu/vlaufer/snpeff/snpEff/NCBI_August_Hackathon_Push_Button_Genomics_Solution/SnpSift.jar filter ( VC = 'SNV' ) & ( ANN[0].EFFECT has 'missense_variant' | ANN[0].EFFECT has 'nonsense_variant' ) & ( ANN[0].IMPACT has 'MODERATE' ) /home/ubuntu/segun/snakemake.testrun/results/dummy_1.annotated.vcf > /home/ubuntu/segun/snakemake.testrun/results/dummy_1.annotated.filtered.vcf
the first argument must be the file name, and the file name should have the suffix annotated.vcf (but that should be so after vcf annotation anyway)
the remainder of the arguments can be in any order. Because this script uses SnpEff, it can handle any number of transcripts.
Write a script to apply the user's filter selections to an annotated VCF.
Input parameters:
Output:
@nicovbing, this is the issue we discussed.