FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

Please return an error when runnning out of memory #15

Closed patrickvdb closed 6 years ago

patrickvdb commented 6 years ago

In SNPsplit_genome_preparation the program stopped after the line Now reading in and storing sequence information of the genome specified in: /data/data/ref_genome/. It said complete, but I did not obtain the files I wanted. I increased the available memory and the program performed as intended. Please return an error if the genome can't be loaded into memory.

FelixKrueger commented 6 years ago

Hi @patrickvdb,

I'm not quite sure what the error here was to be honest. Reading the genome into memory doesn't actually finish with a message called complete. Also, if you either run the script without having enough system resources in the first place the process will get killed by the operating system, in which case you can not capture or report anything. If you specified insufficient memory e.g. for a qsub process then the process will get killed by the queuing system, but you should then be able to see this in the error log in of GridEngine process. Does that make sense?

patrickvdb commented 6 years ago

I was running it locally using Docker. After increasing the memory the program continued farther than initially, but it still does not finish entirely so I will try it soon without Docker and see if that solves it

FelixKrueger commented 6 years ago

Alright, let me know how you get on. Felix

patrickvdb commented 6 years ago

My apologies. It successfully finished now after running directly on Mac. I was using Docker, because support for Mac was not explicitly mentioned, but it worked fine.

FelixKrueger commented 6 years ago

Excellent, I'm glad it worked! All the best for the subsequent steps!

patrickvdb commented 6 years ago

Initially the attempt I mentioned looked like it worked because I got fasta files. The all_snps file, however, was empty. I switched environments again and ran it on a cluster with 16GB memory (same amount as my macbook) and the job was killed with the following message:

...
Processing chromosome 5 (for strain 129S1_SvImJ)
Reading SNPs from file /data/scratch/166748.gb-ce-lumc.lumc.nl/private/out/SNPs_129S1_SvImJ/chr5.txt
Writing modified chromosome (N-masking)
Writing N-masked output to: /data/scratch/166748.gb-ce-lumc.lumc.nl/private/out/129S1_SvImJ_N-masked/chr5.N-masked.fa
Writing modified chromosome (incorporating SNPs)
Writing full sequence output to: /data/scratch/166748.gb-ce-lumc.lumc.nl/private/out/129S1_SvImJ_full_sequence/chr5.SNPs_introduced.fa
250387 SNPs total for chromosome 5
250387 positions on chromosome 5 were changed to 'N'
250387 reference positions on chromosome 5 were changed to the SNP alternative base

Summary
5173413 Ns were newly introduced into the N-masked genome for strain 2 [129S1_SvImJ] in total
5173413 SNPs were newly introduced into the full sequence genome version for strain 2 [129S1_SvImJ] in total

Determining new Ref [CAST_EiJ] and SNP [129S1_SvImJ] annotations
============================================================

Writing CAST_EiJ specific SNPs (relative to the GRCm38 reference) to >>CAST_EiJ_specific_SNPs.GRCm38.txt<<
Writing 129S1_SvImJ specific SNPs (relative to the GRCm38 reference) to >>129S1_SvImJ_specific_SNPs.GRCm38.txt<<
Writing SNPs in common between CAST_EiJ and 129S1_SvImJ (relative to the GRCm38 reference) to >>CAST_EiJ_129S1_SvImJ_SNPs_in_common.GRCm38.txt<<
Writing all new SNPs >>CAST_EiJ/129S1_SvImJ to >>all_129S1_SvImJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt<<

Storing SNP positions for strain CAST_EiJ provided in 'all_SNPs_CAST_EiJ_GRCm38.txt.gz'
=>> PBS: job killed: mem job total 17700684 kb exceeded limit 16777216 kb

I then increased the memory to 35GB and it finished (almost) flawlessly:

Processing chr10 (to create new genome for CAST_EiJ/129S1_SvImJ)
Writing modified chromosome (N-masking)
Writing N-masked output to: /scratch/166800.gb-ce-lumc.lumc.nl/private/out/CAST_EiJ_129S1_SvImJ_dual_hybrid.based_on_GRCm38_N-masked/chr10.N-masked.fa
Writing modified chromosome (incorporating SNPs)
Writing full sequence output to: /scratch/166800.gb-ce-lumc.lumc.nl/private/out/CAST_EiJ_129S1_SvImJ_dual_hybrid.based_on_GRCm38_full_sequence/chr10.SNPs_introduced.fa
1146467 SNPs total for chromosome 10
1146467 positions on chromosome 10 were changed to 'N'
1146467 reference positions on chromosome 10 were changed to the SNP alternative base

Summary
20563466 Ns were newly introduced into the N-masked genome for strain/strain 2 [CAST_EiJ/129S1_SvImJ] in total
20563466 SNPs were newly introduced into the full sequence genome version for strainstrain 2 [CAST_EiJ/129S1_SvImJ] in total

All done. Genome(s) are now ready to be indexed with your favourite aligner!
FYI, aligners shown to work with SNPsplit are Bowtie2, Tophat, STAR, Hisat2, HiCUP and Bismark (STAR and Hisat2 require disabling soft-clipping, please check the SNPsplit manual for details)

gzip: stdout: Broken pipe

gzip: stdout: Broken pipe

The file all_129S1_SvImJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt was not compressed, but seems otherwise to be correct.

Why did the job run out of memory? Is the software really using 16GB? Maybe the manual should have a systems requirement section.

FelixKrueger commented 6 years ago

Hi @patrickvdb,

Yes the job may indeed take quite a bit of memory, especially if you choose genomes with lots of SNPs to the Black6 reference such as Cast, and even more so in dual-SNP mode when it holds several copies of genomes in memory. I am sure the memory handling could be improved, I am probably a little spoiled since anything we work on comes with at least 64GB of RAM...