Estimated resource requirement

mbhall88 commented 2 weeks ago

As part of https://github.com/openjournals/joss-reviews/issues/6850 I have been trying to run primerForge using MTB as the ingroup and M. smegmatis as the outgroup

primerForge -i h37rv.fa -u MSmeg.fa -f fasta -n8

so far I have been unable to get this to complete as it keeps hitting my job memory limits. The last run I tried allocated 160GB and the job failed due to out of memory problems... This seems very high. Is this expected? If so, this should probably be documented somewhere. Or am I doing something wrong here? There isn't an example usage (see #2) so I am just basing my executation on the CLI help menu

dr-joe-wirth commented 2 weeks ago

Can you share the input files you are using? That is unexpected behavior.

mbhall88 commented 2 weeks ago

This is the MTB genome https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3 and this is the M. smegmatis genome https://www.ncbi.nlm.nih.gov/nuccore/NC_008596.1

I am using the bioconda installation of primerForge

dr-joe-wirth commented 2 weeks ago

Can you please rerun using the --debug flag and then share the .log file with me? Also, how do you determine how much memory was being allocated?

mbhall88 commented 2 weeks ago

Okay, running now. Will share the log when it's done.

Well I requested 160GB of memory from slurm and then the job failed with 'out of memory'

mbhall88 commented 2 weeks ago

Just failed with 200GB (max RSS 208658340KB)

INFO:bin.main:version:                          1.1.1
INFO:bin.main:ingroup:                          /data/scratch/projects/punim1703/WGA/data/references/h37rv.fa
INFO:bin.main:outgroup:                         /data/scratch/projects/punim1703/WGA/data/references/MSmeg.fa
INFO:bin.main:results filename:                 /data/scratch/projects/punim1703/tbvcf/tmp/primerForge/results.tsv
INFO:bin.main:file format:                      fasta
INFO:bin.main:min kmer len:                     16
INFO:bin.main:max kmer len:                     20
INFO:bin.main:min % G+C                         40.0
INFO:bin.main:max % G+C                         60.0
INFO:bin.main:min Tm:                           55.0
INFO:bin.main:max Tm:                           68.0
INFO:bin.main:max Tm difference:                5.0
INFO:bin.main:min PCR size:                     120
INFO:bin.main:max PCR size:                     2400
INFO:bin.main:disallowed outgroup PCR sizes:    120-2400
INFO:bin.main:num threads:                      8
INFO:__getCandidates:identifying kmers suitable for use as primers in all 1 ingroup genome sequences
INFO:_getAllCandidateKmers:    getting shared ingroup kmers that appear once in each genome
DEBUG:__getSharedKmers:        18539592 shared kmers after processing h37rv.fa
INFO:_getAllCandidateKmers:    done 00:06:54.69
INFO:_getAllCandidateKmers:dumping shared kmers to '_pickles/sharedKmers.p'
INFO:_getAllCandidateKmers:done 00:01:09.11
INFO:_getAllCandidateKmers:    evaluating kmers
INFO:_getAllCandidateKmers:    done 00:02:14.37
INFO:_getAllCandidateKmers:    identified 843387 candidate kmers
INFO:__getCandidates:done 00:10:20.55
INFO:__getCandidates:dumping candidate kmers to '_pickles/candidates.p'
INFO:__getCandidates:done 00:00:04.04
INFO:__getUnfilteredPairs:identifying pairs of primers found in all ingroup sequences

dr-joe-wirth commented 8 hours ago

I believe I have resolved this issue. I have improved performance in many places. When I ran with your inputs, I recorded it using <15gb ram. I am going to close this issue; please feel free to report back if I need to reopen it.

dr-joe-wirth commented 7 hours ago

note: the conda installation and docker image are not currently up and running for the updated version. please use only the manual installation or the pip installation for the time being.

mbhall88 commented 4 hours ago

great. can confirm this completed with 8 threads in 13 mins and ~23GB memory.

dr-joe-wirth / primerForge

Estimated resource requirement #23