Job runs out of memory in cluster

iagooteroc commented 3 years ago

Hello. I used this tool with Illumina data with no problem, but in my attempts to use Nanopore Sequencing data I keep running out of memory even when I request big amounts. Could it be that the program tries to use the full memory of the cluster node? For example, with this command:

${VERIFY_BAM_ID_HOME}/bin/VerifyBamID \
  --Epsilon 1e-12 \
  --SVDPrefix ${VERIFY_BAM_ID_HOME}/resource/1000g.phase3.10k.b38.vcf.gz.dat \
  --Reference ${REF} \
  --BamFile ${BAM} \
  --Output ${OUT}

And with a BAM file of 142G, I requested 350G of memory. The job ran out of memory and so I also tried splitting the BAM and using only the chr19 (2.2G). I tried requesting 35G but it also ran out and I am waiting for the results of the same job but with 130G. Is this normal?

Thank you.

Griffan commented 3 years ago

@iagooteroc No, this is abnormal. Could you give me more specific status of your dataset? Could you try specifying —pileup to output pileup format file for debugging and also post your log?

iagooteroc commented 3 years ago

@Griffan it is from NA12878 cell line. We got the fast5, basecalled them, and aligned with minimap 2 to the hg38. I added --OutputPileup in the last job I sent, I'll let you know when it finishes. Thank you.

iagooteroc commented 3 years ago

Hi again. I specified --PileupFile filename and --OutputPileup options but I don't see the pileup file anywhere. Am I doing something wrong? Here is the error log from the SLURM job: Scr_verifyBamID2.7329314.err.txt Thank you.

Griffan commented 3 years ago

@iagooteroc What is the sequencing depth of your bam? Could you test on a very small bam file, e.g. a handful of reads, so that we can tell if it's the read length or the depth.

"--PileupFile filename" gives your the option to start from pileup format files.

iagooteroc commented 3 years ago

Hello again. I tried with a different ONT file of 20GB which I guess is about 6X, and again ran out of memory. I subsampled the file with 10% reads (2GB) and still ran out of memory. Lastly, I subsampled 0.1% reads (200MB, ~13k reads) and it finished but with "insufficient available markers". This is the pileup file: LB771-PBL_subsample0.001.Pileup.txt Do you want me to share any of the .bam files to test it yourself? Thank you.

Griffan commented 3 years ago

@iagooteroc Yes, please. An actual bam file would help me debugging. Thanks

iagooteroc commented 3 years ago

@Griffan This is the bam subsampled at 10% reads (2GB) aligned with hg19: LB771-PBL_subsample0.1.bam LB771-PBL_subsample0.1.bam.bai

Griffan commented 2 years ago

@iagooteroc I have updated the related code, the fix passed my local tests. Please let me know if anything abnormal on your side.

iagooteroc commented 2 years ago

@Griffan Hi again. Sadly it keeps failing in the cluster I'm using. These are the last lines of the output:

NOTICE - Process 1:117326904-117326904... NOTICE - Process 1:117541054-117541054... NOTICE - Process 1:117818497-117818497... /var/log/slurm/spool_slurmd/job7449270/slurm_script: line 34: 21561 Killed ${VERIFY_BAM_ID_HOME}/bin/VerifyBamID --Epsilon 1e-12 --SVDPrefix ${VERIFY_BAM_ID_HOME}/resource/1000g.phase3.10k.b37.vcf.gz.dat --Reference ${REF} --BamFile ${BAM} --Output ${OUT} --OutputPileup slurmstepd: error: Exceeded step memory limit at some point.

This is the script I'm using to launch the SLURM job with the subsampled BAM I shared with you: Scr_verifyBamID2.sh.txt Thank you for your work.

Griffan commented 2 years ago

@iagooteroc I noticed that in your original script, you had both "--PileupFile ${OUT}.pileup" and "—OutputPileup" which shouldn't be present at the same time. Because --PileupFile ${OUT}.pileup acts as the input argument in case the bam file is not available. For your case, there is a simple workaround by replacing —BamFile ${BAM} with —PileupFile ${BAM}.pileup, where the ${BAM}.pileup can be obtained using samtools and named unix pipe.

Meanwhile, could you share with me the maximal memory you used when testing LB771-PBL_subsample0.1.bam file, because I finished running this file on my labtop, it shouldn't crash on cluster.

iagooteroc commented 2 years ago

@Griffan Thank you, I noticed and removed the --PileupFile option and kept the --OutputPileup. The process is killed before saving the Pileup for using more memory than reserved. I am requesting 35GB for the job and the cluster node has ~122GB in total.

iagooteroc commented 2 years ago

@Griffan Hi again. Any news from your part? Could you include your last commit in the docker image? I'm trying it but I get [Warning::SimplePileup] initialization fail to parse region with Nanopore files. Thank you.

Griffan commented 2 years ago

@iagooteroc I wonder if the master branch has resolved this issue?

iagooteroc commented 2 years ago

@Griffan Do you mean the one from statgen? It hasn't been updated since this repo started. Anyways, I had no luck with docker either, it also runs out of memory. I will try to run it in one of our computers next week.

Griffan commented 2 years ago

@iagooteroc No, I mean this repo's. Could you also check the version of your htslib?

Sorry, I found version of htslib in your script

Another twist, in samtools 1.10, htslib has already been a separate package, so it still make sense to check the htslib on your machine, thanks!

iagooteroc commented 2 years ago

@Griffan I tried your suggestion of generating the pileup file using samtools and found out that it is known to use excessive cpu and memory working with nanopore data. I read that it can be avoided by disabling BAQ alignment with the -B option. With that I was able to obtain the pileup data and use VerifyBamID correctly. Does VerifyBamID need the BAQ alignment data? If not, this issue is resolved. Thank you.

Griffan commented 2 years ago

@iagooteroc I'm glad that the work around method worked.

Yes, I also noticed that was the reason affecting this behavior.
And after this PR:https://github.com/samtools/bcftools/pull/1474, there are actually 3 states for this -B option, and VB2 followed this default setting to calculate this value if it's not present in the bam file(perhaps another workaround here). Back to Nanopore dataset, this seems also to be an ongoing open issue on samtools :https://github.com/samtools/bcftools/issues/1584, I'd suggest you to follow the recommendation there at the moment.

I will close this issue for now, and please feel free to reopen it if you still have questions.

Griffan / VerifyBamID

Job runs out of memory in cluster #32