issues running gtotree on cluster

adamsorbie commented 4 years ago

Hey,

I'm trying to run gtotree on a linux cluster but i'm getting some strange errors. For the majority of genomes, when searching for the target genes, the operation is aborted and I get the following error:

Fatal exception (source file esl_threads.c, line 128):
thread creation failed
/dss/dsshome1/lxc00/ga92yuh2/.conda/envs/gtotree/bin/gtt-ncbi-parallel.sh: line 235: 83395 Aborted                 (core dumped) hmmsearch --cut_ga --cpu $num_cpus --tblout ${tmp_dir}/${assembly}_curr_hmm_hits.tmp $hmm_file ${tmp_dir}/${assembly}_genes.tmp > /dev/null

The above is from the job output, the log file or gtotree looks mostly fine. I do get this error though, which I suppose is actually due to above error rather than these genomes actually having too few hits:

8030 genome(s) removed from analysis due to having too few hits.

        Reported in "ohyA_bacterial_tree/run_files/Genomes_removed_for_too_few_hits.tsv".

I'm not 100% sure what could be causing this error but my first guess would be it's running out of the memory for genomes which are a bit larger. Right now I requested 100gb of RAM, on this particular cluster I have up to 2.5TB available but before I go requesting more I wanted to see if anyone had came across a similar thing before?

thanks,

Adam

AstrobioMike commented 4 years ago

Hi there, Adam!

I unfortunately still have little experience working on clusters – and even less experience designing a program to make sure it won’t cause any problems on a cluster :/

But seeing the message about thread creation failed, I wonder if the problem could be the combination of how GToTree is trying to run things in parallel with how the job manager is allocating resources. I know it's not ideal, but i'd be curious if you get the same type of result if running with the default of -j 1 – really not ideal with the number of genomes you are running, sorry :/ GToTree is really light on memory use (outside of what the alignment and tree step require which can vary a lot based on the total being done), so I'd be surprised if that were the problem at this stage of downloading and searching the genomes unless again some crazy number of them were being done concurrently.

I wish i had more to say to try to help, sorry :/ I don't even have access to a cluster i could use for testing my own things like this currently. Please keep me posted if you are able to figure any more out!

adamsorbie commented 4 years ago

I played around with it a bit on the interactive segment of our cluster and it seems you were right. Reducing the -j parameter seemed to solve the problems I was having. Luckily, it still works with up to 50 but any higher and I start running into problems.

Thanks for your help Mike!

AstrobioMike / GToTree

issues running gtotree on cluster #19