brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Apache License 2.0
222 stars 21 forks source link

smoove duphold error during merge #183

Closed jjfarrell closed 2 years ago

jjfarrell commented 2 years ago

When running the following command on 22,847 crams, I am getting an "could not load index" from bcftools. See output below.

cat cram_22847.list|xargs -s 2080845 smoove duphold --processes $NSLOTS --fasta /restricted/projectnb/casa/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa --vcf pVCF.ont/ont.graphtyper.chr$N.vcf.gz --outvcf pVCF.ont/ont.graphtyper.chr$N.duphold.vcf.gz

This is the beginning of the output: [smoove] 2022/01/09 00:08:14 starting with version 0.2.6 [smoove] 2022/01/09 00:08:14 running duphold on 22847 files in 32 processes [smoove] 2022/01/09 00:09:24 [duphold] finished [smoove] 2022/01/09 00:09:25 [duphold] finished

This is the error message at the end:

tail smoove_duphold_ont.sh.o1742754 [smoove] 2022/01/09 05:09:00 [duphold] finished [smoove] 2022/01/09 05:09:01 [duphold] finished [smoove] 2022/01/09 05:09:02 [duphold] finished [smoove] 2022/01/09 05:09:02 [duphold] finished [smoove] 2022/01/09 05:09:03 [duphold] finished [smoove] 2022/01/09 05:09:04 [duphold] finished [smoove] 2022/01/09 05:09:04 starting bcftools merge [smoove] 2022/01/09 11:37:48 [E::hts_idx_load3] Could not load local index file 'tmp.22/tempclean-606560922/dh009936222smoove-duphold.bcf.csi' Failed to open tmp.22/tempclean-606560922/dh009936222smoove-duphold.bcf: could not load index 2022/01/09 11:38:53 exit status 255

The temp files are cleaned up on exiting so I am not able to look at the bcf files to see what may be happening.

Any suggestions for this?

brentp commented 2 years ago

can you run duphold on just that sample and see what happens? with this many samples, it might be better to do the parallelization yourself rather than use smoove duphold which relies on a single machine (maybe that's what you're doing, in part, with xargs, but I can't tell).

jjfarrell commented 2 years ago

I am not sure how to determine which cram/sample corresponds to the bcf file, dh009936222smoove-duphold.bcf, since it gets erased on exit. I am running smoove duphold on one compute node with 196 GB and 32 cores for each chr. It runs surprisingly fast. I am guessing it is running out of memory on the merge. If the files were not deleted, I could play with getting the merge going. Is there a debug option to not delete the temp bcf files?

jjfarrell commented 2 years ago

I did find a snapshot of the temp bcf directory that contained the bcf that failed to load the csi. The bcf looked fine and was indexed. I was able to merge it with another bcf. So I think this is likely a bcftools merge issue.

brentp commented 2 years ago

yeah. bcftools merge has problems with many SV vcfs. I've seen that in smoove square as well. I think it has trouble when a single huge (likely spurious) event spans the entire chromosome and I guess that it keeps that in memory, along with all the variants in contains.

jjfarrell commented 2 years ago

The issue is the number of open files allowed on linux. The default for our cluster is 1024:

ulimit -n
1024

This can be increased to 2048. On our system, we can not increase it to a larger number.

 ulimit -n 30000
-bash: ulimit: open files: cannot modify limit: Operation not permitted

A workaround for this on Biostar uses the following approach:

ls *.vcf.gz | split -l 500 - subset_vcfs

for i in subset_vcfs*; 
do 
bcftools merge -0 -l $i -Oz -o merge.$i.vcf.gz; 
tabix -p vcf merge.$i.vcf.gz
done

ls merge.*.vcf.gz > merge.txt
bcftools merge -l merge.txt -0 -Oz -o all_merged.vcf.gz

The list of files to merge are split into subsets under the ulimit -n size and then merged. Is this approach something that could be implemented in smoove duphold for merging?

jjfarrell commented 2 years ago

I have put in a request to to the our IT support to increase the ulimit soft limit max. Hopefully that will solve the issue at our site.

If the hierarchical merge above is not feasible- It would be great if a the number of files passed to smoove duphold could be checked again the ulimit -n on startup. If the N of sample files are too large, stop the job with an error message to increase the ulimit -n to N files. Right now the tool will run duphold on 20000+ samples successfully and then terminate at the merge step.

brentp commented 2 years ago

I am surprised that you haven't hit this limit before! The best way would be to run duphold for each sample, then use svtools merge to merge the VCFs

jjfarrell commented 2 years ago

Actually, I found the max was set to 20K on the batch queue jobs which is why I hit the error this time. It was set to the lower number on the login node and when sshing into a node. So the higher ulimit max is now working fine with smoove. Just an earlier check and better error message would be helpful.

brentp commented 2 years ago

thanks for following up. I plan to revisit this merging step at some point.