bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.07k stars 182 forks source link

large files #533

Open alaraints opened 3 years ago

alaraints commented 3 years ago

Hello, when analyzing large input files (20-60 GB), Diamond invariably crashes in the end, I suppose when it tis time to write the results. I am running with 60 GB memory, block size 6, so there is plenty of memory for most of the time. Is there a way to make Diamond write output sequentially? Or is there some other problem? Max input file size limit? Best regards,

Alar Aints

bbuchfink commented 3 years ago

How many sequences do your files contain?

alaraints commented 3 years ago

100 Million - 200 Million. Trying to run BlastX against Refseq

Greetings,

Alar

On 29 Nov 2021, at 10:58, Benjamin Buchfink @.***> wrote:

How many sequences do your files contain?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bbuchfink/diamond/issues/533#issuecomment-981415885, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOD2XPRZED63QVMVR4PRNZTUOM6D3ANCNFSM5I6OSQCQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bbuchfink commented 3 years ago

You may want to try a smaller block size. Otherwise, I'm not sure why this might crash and need to run some tests, but this may take time.

alaraints commented 2 years ago

Hello Dr Buchfink,

the error persists. I have upgraded Conda and Diamond, reduced the block size to 5 and input file size to 10 million reads, but still the program crashes. The last two lines of the report file are always:

Opening temporary output file... [0.101s] Computing alignments... /var/spool/slurm/slurmd/job24517243/slurm_script: line 516: 174051 Bus error

Any suggestions?

Best regads,

Alar Aints

On 29 Nov 2021, at 15:15, Benjamin Buchfink @.***> wrote:

You may want to try a smaller block size. Otherwise, I'm not sure why this might crash and need to run some tests, but this may take time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bbuchfink/diamond/issues/533#issuecomment-981624557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOD2XPXI6MUEUIWIKQRZQ5LUON4HXANCNFSM5I6OSQCQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

alaraints commented 2 years ago

Hello again - I ran the script again with the same Diamond paramaters, but a small file, 60 k reads. This time it worked. Took 21GB memory. It appears to me that the program is trying to allocate memory based on the file size, not the block size, for computing alignements. (Should it even compute alignements when --outfmt 6 is specified?) Best regards,

Alar Aints.

Hello Dr Buchfink,

the error persists. I have upgraded Conda and Diamond, reduced the block size to 5 and input file size to 10 million reads, but still the program crashes. The last two lines of the report file are always:

Opening temporary output file... [0.101s] Computing alignments... /var/spool/slurm/slurmd/job24517243/slurm_script: line 516: 174051 Bus error

Any suggestions?

Best regads,

Alar Aints

On 29 Nov 2021, at 15:15, Benjamin Buchfink @. @.>> wrote:

You may want to try a smaller block size. Otherwise, I'm not sure why this might crash and need to run some tests, but this may take time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bbuchfink/diamond/issues/533#issuecomment-981624557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOD2XPXI6MUEUIWIKQRZQ5LUON4HXANCNFSM5I6OSQCQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bbuchfink commented 2 years ago

The memory use should only depend on block size, not file size. You can try to further reduce the block size. Another option to reduce memory use is --bin, for example you can try --bin 64.

alaraints commented 2 years ago

Hello, thank you for the reply - however, it is quite evident that the memory use depends on the file size. I have now successfully processed some files and monitored the SLURM performance using Kibana. The first set of peaks corresponds to the first file of 654 MB, 3 393 711 contigs; the last three peaks correspond to the second file, 557 MB, 2 854 680 contigs. Memory use is shown as % of 60 GB. The peaks correspond to computing alignements. The first peak use is 74%, 44.4 GB, plateau 36%, 21.6 GB. The second set of peaks are 66%, 40 GB; plateau 35%, 21 GB. Script:

SBATCH --cpus-per-task=6

SBATCH --mem=60000

Command line: diamond blastx --query contigs.fasta --db refseq --out DX_Match.txt --unal 0 --min-orf 1 --un LO.txt -b 5.0 --sensitive -k 1 --threads 6 --evalue 0.01 --max-hsps 2 --outfmt 6

Best regards,

Alar Aints.

On 13 Dec 2021, at 15:48, Benjamin Buchfink @.***> wrote:

The memory use should only depend on block size, not file size. You can try to further reduce the block size. Another option to reduce memory use is --bin, for example you can try --bin 64.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bbuchfink/diamond/issues/533#issuecomment-992492472, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOD2XPQWZ2DLBFJ4OINASPLUQX2RDANCNFSM5I6OSQCQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

bbuchfink commented 2 years ago

Running blastx on contigs is a different story, unfortunately the current implementation can't handle very long queries well. Using the frameshift mode (-F 15) should work better in these cases.