RosettaCommons / trRosetta2

Repository for publicly available deep learning models developed in Rosetta community
MIT License
106 stars 20 forks source link

pick_final_models.div.py references Robetta #5

Open martinpacesa opened 3 years ago

martinpacesa commented 3 years ago

Hello,

Fantastic work and thank you so much for releasing this to the public! I have encountered several issues with running trRosetta2 on Ubuntu: 1) some dependencies, like scikit learn had to be installed manually 2) the DB variable is not passed properly to all SStor_pred scripts, I had to change the paths in the main*.py files from /projects/ml/ manually myself 3) On a single Nvidia Geforce RTX 2080 Ti I quickly run out of memory when the Tensorflow library is loaded during trRefine. I fixed this by setting the conda environment variable "conda env config vars set TF_FORCE_GPU_ALLOW_GROWTH=true" and limiting the CPUs to 4 for this single step (other steps I use 20 CPUs) 4) during the DeepAccNet-msa step I get the following error: `/bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found/home/robetta/rosetta_server_beta/bin/lddt: not found

/bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found /bin/sh: 1: /home/robetta/rosetta_server_beta/bin/lddt: not found multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/patchy/anaconda3/envs/casp14-baker/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, *kwds)) File "/home/patchy/anaconda3/envs/casp14-baker/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/home/patchy/trRosetta2/trRefine/pick_final_models.div.py", line 70, in calc_lddt_dist lddt_1 = float(os.popen("/home/robetta/rosetta_server_beta/bin/lddt -c %s %s | grep Glob"%(pose_i, pose_j)).readlines()[-1].split()[-1]) IndexError: list index out of range """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/patchy/trRosetta2/trRefine/pick_final_models.div.py", line 113, in raw_dist = pool.map(calc_lddt_dist, args) File "/home/patchy/anaconda3/envs/casp14-baker/lib/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/patchy/anaconda3/envs/casp14-baker/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value IndexError: list index out of range`

The problem is that trRosetta2/trRefine/pick_final_models.div.py has the following lines 70-71 referencing to a Robetta function. lddt_1 = float(os.popen("/home/robetta/rosetta_server_beta/bin/lddt -c %s %s | grep Glob"%(pose_i, pose_j)).readlines()[-1].split()[-1]) lddt_2 = float(os.popen("/home/robetta/rosetta_server_beta/bin/lddt -c %s %s | grep Glob"%(pose_j, pose_i)).readlines()[-1].split()[-1])

Any idea how to circumvent this? Thank you!

martinpacesa commented 3 years ago

Ah nevermind, I just figured out that lddt is already part of the trRosetta2 package! I changed the path to reference trRosetta2/lddt/lddt and now everything works!

martinpacesa commented 3 years ago

Also I just noticed crop="dicont" should be crop="discont" in run_pipeline.sh

partrita commented 3 years ago

@martinpacesa could share the code that you fix? Thanks.

martinpacesa commented 3 years ago

What error are you getting? It really is just a question of adjusting the proper paths, because in some cases the paths are fixed.

gjoni commented 3 years ago

I have now changed the paths to relative ones. Some other bugs were fixed too. You may pull the updated code and try once again.

martinpacesa commented 3 years ago

line 68 in run_pipeline.sh still says crop="dicont"

gjoni commented 3 years ago

fixed this one too. thanks for catching all these bugs!

martinpacesa commented 3 years ago

Glad to help, thanks for sharing the code!!

I had one question, do you think there is a protein size limit that I can fold on a single 11Gb GPU? I just noticed that a 900AA protein ran fine, but with 1000AA the tensorflow gave a memory error. Just wondering if it's related or I have some other problem there.

gjoni commented 3 years ago

GPU memory should not be a problem for trRosetta if crop="discont" is used in step 5 of the pipeline. trRefine (step 7) may fail though on bigger proteins because of insufficient GPU memory. Which step is causing errors in your case?

martinpacesa commented 3 years ago

I had it during trRefine, like I said, anything below 900AA ran fine, I am now running a 1400AA protein to see if I get the same error. This is the error I got during the folding of the 1000AA protein (from trRefine.stderr): trRefine.txt

martinpacesa commented 3 years ago

From what I found, this error might come from when the tensorflow tries to allocate bigger memory chunks, is it possible to make it allocate smaller chunks?

martinpacesa commented 3 years ago

I have managed to solve the issue by reducing the limit for cropping to 500:

if [ $LEN -gt 500 ] then crop="discont" else crop="cont" fi

Is there a downside to reducing it from 700?

gjoni commented 3 years ago

crop="discont" works better for big proteins and is a bit more accurate at recapitulating domain-domain interactions compared to crop="cont". The latter on the other hand gives a bit more accurate predictions for individual domains. We haven't benchmarked this thoroughly but I feel that reducing $LEN down to 500 should be fine.

martinpacesa commented 3 years ago

Thank you for the answer! I will keep the 500 value for now and report back if I see any issues.

martinpacesa commented 3 years ago

Okay, so after I have pulled the newest version of the code from github, I get the following error right after running hhsearch:

/home/patchy/trRosetta2/runpipeline.sh: line 53: 14657 Segmentation fault (core dumped) $HH -i $WDIR/t000.msa0.ss2.a3m -o $WDIR/t000_.hhr -v 0 > $WDIR/log/hhsearch.stdout 2> $WDIR/log/hhsearch.stderr

HHBLITS and PSIPRED seem to run fine but the hhsearch error log is empty

martinpacesa commented 3 years ago

I have 128 Gb of RAM and don't see it running out of memory, I did not have issue with the previous version of the code where hhsuite was called locally

martinpacesa commented 3 years ago

I can confirm the segmentation fault only happens during the hhsearch step and only in the newly built conda environment, in the environment built in the previous version it works fine

martinpacesa commented 3 years ago

Removing the hhsuite from the conda environment fixes the segmentation fault error, I could not figure out why it occurs though.

martinpacesa commented 3 years ago

I am still having trouble folding proteins above 800AA with the RTX 2080 Ti with 11 Gb VRAM. During the trRefine step, the tensorflow module runs out of memory. Ideally I would get more VRAM, but currently it's hard to get your hands on a set of two graphics cards around here (thanks crypto). Is there a way to run trRefine on the CPU?

gjoni commented 3 years ago

Yeah, we are also having same issues with long proteins. An easy workaround is to run trRefine on a cpu by modifying line #131 of the run_pipeline.sh script to:

CUDA_VISIBLE_DEVICES="" python $PIPEDIR/trRefine/run_trRefine_DAN.py -msa_npz $WDIR/t000_.msa.npz \

In my tests it takes around 10 mins to process a 1300aa-long protein using 4 cpu cores and 32gb of RAM.

martinpacesa commented 3 years ago

Thanks a lot for the tip!!! I will give it a go and let you know. I will probably just make it an conditional statement to run anything smaller than 800 on GPU and anything bigger on CPU

martinpacesa commented 3 years ago

Yeah, we are also having same issues with long proteins. An easy workaround is to run trRefine on a cpu by modifying line #131 of the run_pipeline.sh script to:

CUDA_VISIBLE_DEVICES="" python $PIPEDIR/trRefine/run_trRefine_DAN.py -msa_npz $WDIR/t000_.msa.npz \

In my tests it takes around 10 mins to process a 1300aa-long protein using 4 cpu cores and 32gb of RAM.

This did not seem to work for me, I get the following error. As far as I understand it's due to my CPU architecture?

trRefine.txt