asoltis / MutEnricher

Somatic coding and non-coding mutation enrichment analysis for tumor WGS data
Other
9 stars 3 forks source link

Job won't run to completion #7

Open Samir-A opened 1 year ago

Samir-A commented 1 year ago

Hello,

I really like your tool but I was wondering why my jobs won't run to completion. It always stops at Performing weighted average proximity (WAP) hotspot enrichments... Is it possible you can help me to figure why this is the case?

The commands I run are:

python MutEnricher-master/mutEnricher.py noncoding /data2/samir/SNP_Hotspot/bed/Introns_merged_mm10.bed VCF_Cd1_Nmasked.txt -o Cd1_Nmasked_Intron_noncoding --prefix test_global -p 5

python MutEnricher-master/mutEnricher.py noncoding /data2/samir/SNP_Hotspot/bed/Introns_merged_mm10.bed VCF_Cd1_snps_filtered.txt -o Cd1_snps_filtered_final_Intron_noncoding --prefix test_global -p 5

My bed file is here (I changed it to .txt form so I could upload): the bedfile was sorted and merged Introns_merged_mm10.bed.txt head: chr1 3216968 3421701 1_Intron chr1 3421901 3670551 2_Intron chr1 4120073 4142611 3_Intron chr1 4142766 4147811 4_Intron chr1 4147963 4148611 5_Intron chr1 4148744 4163854 6_Intron chr1 4163941 4170204 7_Intron chr1 4170404 4197533 8_Intron chr1 4197641 4206659 9_Intron chr1 4206837 4226610 10_Intron

My vcf.txt files are here: /data2/samir/SNP_Hotspot/vcf/Cd1_Nmasked_final_sorted.mm10.vcf.gz Cd1_Nmasked /data2/samir/SNP_Hotspot/vcf/Cd1_snps_filtered_final_sorted.mm10.vcf.gz Cd1_snps_filtered

I bgzipped, sorted, and indexed them with bgzip and bcftools.

run.txt

run2.txt

asoltis commented 1 year ago

Are you seeing an error or is the procedure just taking a "long" time?

If you disable the WAP procedure (with the --no-wap flag in the noncoding module), does it complete?

I took a look at the regions in your BED file and noticed there are some extremely long regions (e.g. > 100kb, see histograms). The WAP procedure is likely taking a long time on these as it is looping over the length of the region. I would suggest either a) disabling the procedure (as mentioned above) or b) running on a reduced set of regions excluding these very long ones. Introns_merged_mm10_lengths_hist.pdf Introns_merged_mm10_lengths_hist_ylim1000.pdf

asoltis commented 1 year ago

Also, how long of execution are we talking about? With the WAP procedure, it may not be the region lengths per se, but it may be taking time given that there are >185K total regions in your BED file. If you are able, dedicating more processors to the run can also speed things up.

Samir-A commented 1 year ago

I have been running it for 2 days now, but I have some other runs with different annotation files the finished with no hotspots that took around an hour. Thank you for the help, I will try your suggestions!

Samir-A commented 1 year ago

I couldn't find the --no-wap in the documentation. Can you please explain what we might be losing if we use the --no-wap? Because these are introns, after merging all the regions together I don't know another way to shorten the reads without losing a substantial amount of introns.

Thank you

asoltis commented 1 year ago

You can find a brief explanation for that option in the code help page (python mutenricher.py noncoding), which is simply:

"Select flag to skip weighted average proximity (WAP) procedure."

You can still perform the main burden and hotspot analyses without this procedure as it is just another method to assess clustering of mutations (the [prefix]_region_WAP_enrichments.txt output will only report the overall burden p-values and the [prefix]_region_WAP_hotspot_Fisher_enrichments.txt output will only combine the burden and hotspot p-values). The WAP procedure may considerably extend runtime with many long regions as it is a permutation procedure that can run up to 1e6 iterations; it runs for a minimum for 1000 iterations but can exit out if a stopping criteria is met. Given your run times, it seems advisable to disable this procedure. For particular regions of interest, you could run the code again with a limited BED and the WAP procedure enabled to generate these statistics.

Samir-A commented 1 year ago

Thank you for the explanation! But I believe we found the issue to be due to our vcf files. We ran a job with your example vcfs and our annotation file and the job went to completion. Is there a specific way the vcf file should be prepared?

asoltis commented 1 year ago

The sample VCFs were prepared from human genome alignments so it is possible few variants in your VCFs line up with your BED regions (you can check in the outputs); this could be an explanation for the quicker run time. How many VCFs in total are you using? If you run without the WAP procedure, do you see many more variants corresponding to the regions in your BED files?

There is no special processing needed for noncoding analyses, other than sorting and indexing with tabix.