DavidBSauer / OGT_prediction

Scripts for calculating features and regression of prokaryote OGT
GNU General Public License v3.0
22 stars 4 forks source link

barrnap and tRNAscan processes killed #8

Closed morgansobol closed 1 year ago

morgansobol commented 1 year ago

Hi again David,

I am having some issues with memory allocation...I guess? I allocated 20 Gb of memory when I submitted the job. I have ~630 genomes. It seems to only have issues with tRNA and barrnap, and right now it's hard to tell if this is for every genome or just some. Have you any experience with these issues?

error with barrnap bacterial for GCF_002240205.1.fa with a message of sh: line 1: 1282914 Killed error with tRNAscan for GCF_001747405.1.fa with a message of sh: line 1: 1279739 Killed error with tRNAscan for GCF_000019165.1.fa with a message of sh: line 1: 295420 Bus error (core dumped)

Here is the full version of the prediction.log file as it is now, its still running after 15.5 hrs. https://www.dropbox.com/s/dxybinnbcrui3ke/prediction.log?dl=0

Thx!

DavidBSauer commented 1 year ago

Hmm... that is curious. I ran 36k genomes, so the absolute number is likely not the issue. (Though the size of your individual genomes may be larger.)

A few thoughts:

  1. Check the individual tRNAscan and barrnap result files from failed genomes in /output/genomes for anything informative in the log/output files.
  2. Explore if memory is the issue by analyzing the genomes one-at-a-time (editing feature_calculation_pipeline.py to comment out lines 120-123, and uncommenting line 126). This will obviously run slower, but would allow each genome to access all the allocated RAM.

Let me know what you find and if I can be of any help!

Best, David

morgansobol commented 1 year ago

Hi David,

I could not find the log output files in output/genomes/ because it was getting hung up and the entire run killed. So I went ahead and tried point 2, which worked for me. I was able to have the prediction done in ~about one hour. yay!

But, I have ~100 genomes that could not be predicted since no rRNA could be found. This is probably because of the same reason in issue #2, since some of my genomes are MAGs. How do you change the script to not have it rely solely on 16S? Can I simply re-run it only on those genomes that did not work the first time?

DavidBSauer commented 1 year ago

Okay, so I have re-arranged the regression models so those which specifically exclude certain feature classes are neatly organized into their own subdirectory. So when doing your predictions, pick the appropriate subdirectory from data/calculations/prediction/regression_models/

Hopefully this helps!