MrOlm / inStrain

Bioinformatics program inStrain
MIT License
137 stars 33 forks source link

inStrain compare sometimes hanging up #148

Open cusoiv opened 1 year ago

cusoiv commented 1 year ago

Hi! Thanks for developing this wonderful tool and especially the very detailed documentation =)

I seem to have run into a very weird problem when trying to run inStrain compare --- when i try to run it on groups of .IS objects, sometimes the program gets hung up on the first step of Step 2, and it seems to totally completely stop running, because when I try typing in "top" on the node that it is being run, it tells me nothing is running at all. There are no specific nodes where the program always hangs up and it also doesn't seem to correlate to the number of .IS objects being included in the group, since I've seen the problem in groups ranging from 10-50 .IS objects. I also checked if it was an memory issue but the max memory used when the same group can run has not exceeded the amount of memory I requested when the group gets hung up.

I have attached the end of the log.log file where things seem to always stop below:

23-05-02 11:52:57 DEBUG Loading LD-6.IS 23-05-02 11:52:57 DEBUG Loading LD-60.IS 23-05-02 11:52:57 DEBUG Loading LD-64.IS 23-05-02 11:52:57 DEBUG Loading LD-67.IS 23-05-02 11:52:57 INFO 1 of 1 scaffolds are in at least 2 samples 23-05-02 11:52:57 DEBUG Checkpoint Compare CreateScaffoldComparisonObjects end 164802560 23-05-02 11:52:57 INFO *** ..:: inStrain compare Step 2. Run comparisons ::..


23-05-02 11:52:57 DEBUG Checkpoint Compare multiprocessing start 164802560 23-05-02 11:52:57 INFO Running group 1 of 1

Many thanks, Annie

MrOlm commented 1 year ago

Hi @cusoiv - thanks for the kind words and the bug report. A couple of questions-

1) Are you using the latest version of inStrain?

2) Does the hang-up always happen for some group of IS files, or does it work if you re-run?

3) Have you tried with -p 1? That disables the multiprocessing part of inStrain and could fix the issue if that's the problem.

Thanks, Matt

cusoiv commented 1 year ago

Hi @MrOlm,

Thanks for the fast reply! So for the questions -

  1. I am using instrain 1.7.1 , which I think is the latest version of instrain.
  2. The hang up often disappears when I re-run the same group of .IS files, but not for all of them. I started with 80 different groups of .IS objects and 22 hung up on the first run. I was able to get 18/22 of them to eventually run (some took more than 1 rerun try though), and there are 4 that hung up every single time after 5-6 tries.
  3. I have been using -p 1 from the very beginning because when I tested the code initially, it seemed to be not using the -p number of cores I requested and was only using 1 core regardless, so I decided to use -p 1 for everything.

I am attaching the log.log file I have for one of the persistently hung up groups here: log.log

Thanks a lot again! Annie

MrOlm commented 1 year ago

Hi Annie,

How frustrating. OK I have two ideas to try:

1) Upgrading to python 3.9 or above might fix this problem. I think this is a problem with the underlying python multiprocessing code and there was a huge revision of the internals of python multiprocessing in python 3.8.

2) If this doesn't work, it is possible to have python generate a report on where exactly it's spending all this time using a command like follows:

python -m cProfile -o optimize.cprof /home/mattolm/miniconda3/bin/inStrain profile 8004_dereplicated_v1.fasta-vs-Pilot_8004_1_1_F4.subsampled.bam 8004_dereplicated_v1.fasta -o test -g 8004_dereplicated_v1.fasta.fna -s 8004_dereplicated_v1.stb -p 1

The main point here is to start the command with python -m cProfile -o optimize.cprof, and then point the actual location inStrain instead of just typing inStrain (in your current case that would be /home/apps/conda/miniconda3/envs/instrain-1.7.1/bin/inStrain), and then typing the rest of the command as normal. This will generate the file optimize.cprof, which if you send to me I can then load on my computer to see where the problem is. If it hangs, just let it hang for a few hours or so and then kill the run, and it should still generate the optimize.cprof file.

I would start with (1) and only proceed to (2) if it doesn't work.

Thanks for your reporting of the bug and sorry it's happening!

Matt

fbeghini commented 1 year ago

Hi @MrOlm , I've been experiencing the same issue reported by @cusoiv, I tried upgrading Python from 3.8 to 3.10 and running compare with -p 1, but the outcome is always the same, the profiles are all loaded into RAM and then it hangs during the second step. For some runs, the comparison process starts and then hangs during the "Comparing scaffold" step at random percentages, while for others, it never goes beyond the "Running group 1 of 1" step.

This happened for around 20 genomes, while for the remaining genomes, the process completes without any problem. I'm attaching here the profile log for one of the runs. cProfile_MGYG000002545.log

Thank you in advance, Francesco

cusoiv commented 1 year ago

Hi @MrOlm,

Thanks a lot for the suggestions! Sorry I got distracted by other work and didn't start exploring this problem until late this week. I asked our system admins to upgrade the python version for instrain 1.7.1 and 1.7.5, but they found that the highest python version they could upgrade to (without breaking dependencies) was 3.8.16 for instrain 1.7.5 . I tried afterwards with 1.7.5 but the hanging continues.

I am currently trying out the second route but I am a bit confused what to do here. It seems like the command is for the profiling step, but given that I have no problems with the profiling step, should I still run the command for a random metagenome from the set of *IS files that has the compare problem?

I also tried doing python -m cProfile -o optimize.cprof /home/apps/conda/miniconda3/envs/instrain-1.7.5/bin/inStrain compare -i *.IS -s ${basename}.stb -p 1 --genome ${GENOMENAME}_cf_consensus.fasta

but I didn't seem to get any output. So I am a bit confused and wondering if I had misunderstood something, and would really appreciate it if you could clarify a bit.

Many thanks again, Annie

MrOlm commented 1 year ago

Hi @fbeghini and @cusoiv - thank you both for your help tracking down this super frustrating problem. The "cProfile_MGYG000002545.log" that @fbeghini attached just showed the program spending a lot of time loading files with pandas, so not very actionable information.

@fbeghini - am I understanding correctly that this problem only happens for some genomes, and with those genomes it fails every time? If so, would you mind attaching the files needed for me to reconstitute a failed run on my end? This could be the 2 IS profile folders, the genome file, and the command that's failing. If I can reconstitute the problem I my end I should be able to fix it.

Thanks again and sorry this is happening

-Matt

fbeghini commented 1 year ago

Hi @MrOlm , I have 492 IS profiles, while in the log file I see that "1 of 1 scaffolds are in at least 2 samples" and you mentioned sending you the 2 IS folder, is there a file I can check which are the 2 samples?

MrOlm commented 1 year ago

Ahhhh if you're using 492 IS profiles that could be the problem as well. Due to it's use of pair-wise comparisons, running more than 80 profiles or so at a time can take a long time.

What I was suggesting is just trying to run 2 profiles and see if that fixes the problem. If it does not, let me know. If it does, it could just be a problem of running too many profiles.

There are 2 ways of addressing this. You can either divide the groups into random groups (see methods "Strain sharing analysis" here - https://www.biorxiv.org/content/10.1101/2022.03.30.486478v2.full) and/or run genomes 1 at a time (see here - https://github.com/MrOlm/inStrain/issues/133)

Best, Matt

cusoiv commented 1 year ago

Hi @MrOlm,

I am quite sure that I have a certain sets of IS profiles that constantly fail. One of the smallest sets only contains 14 IS profiles, and I realized that when I cancelled the job it printed this row in the error message:

/home/apps/conda/miniconda3/envs/instrain-1.7.5/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown

So it seems to have to do somewhat with the multiprocessing (and memory leaks?) despite me having -p 1. I also did not encounter the same problem if I run all the 14 IS profiles in a pairwise manner -- which I assume I can use as a backup plan if this problem cannot be resolved. I also found that I don't encounter the problem if I run the 14 IS profiles on a node with a much higher memory cap (most nodes on our cluster have a 160G memory cap, and the program hangs on every one of these nodes), despite the memory that was used was actually around 30G.

I am not sure if this is helpful, but I am happy to share the 14 IS profiles if that would be helpful in solving the problem.

Thanks again, Annie

fbeghini commented 1 year ago

There are 2 ways of addressing this. You can either divide the groups into random groups (see methods "Strain sharing analysis" here - https://www.biorxiv.org/content/10.1101/2022.03.30.486478v2.full) and/or run genomes 1 at a time (see here - #133)

Best, Matt

Right now, I'm running one genome at the time (the command line I use is inStrain compare -i ${IS} -s genomes.stb -p 40 -o IS.COMPARE.${genome} --database_mode --breadth 0.3 --genome ${genome}), other genomes with more IS profiles have run successfully, but I'll try the random groups method!

Thanks, Francesco

MrOlm commented 1 year ago

Hi Annie ( @cusoiv ): Ahhh ok- this is almost certainly a RAM issue. I'm used to RAM issues cropping up in a multi-processed setting, but if some samples have very deep sequencing, that could definitely do it. Unfortunately I have tried to reduce RAM usage as much as possible with inStrain, but there's not much more I can do. Sorry I didn't think of this as being the likely problem sooner.

Hi @fbeghini - Let me know if you get a failure with <80 samples or so. It also depends on the depth of sequencing and things like that; I've also had cases where more samples doesn't necessarily take more time.

-MO

cusoiv commented 1 year ago

Thanks @MrOlm, I think I'll try requesting to run anything that gets stuck constantly on nodes with very high memory limits...or go through the pairwise route if it still doesn't work...

MrOlm commented 1 year ago

That sounds like a good plan @cusoiv - let me know if there are continual problems that this doesn't solve.

-MO