ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
737 stars 134 forks source link

Clarify output if spades is killed: error code -9 #76

Open fungs opened 6 years ago

fungs commented 6 years ago

Hi,

I ran metaspades.py (v3.11.1, Linux x86_46 pre-compiled) with for a regular Illumina paired-end library but assembly does not run through. Judging from the error message, it seems not to be an issue with memory. If it is, it needs to be clarified!

metaspades.py -o assembly_metaspades -1 r1.clean.fq.gz -2 r2.clean.fq.gz -s s.clean.fq.gz --phred-offset 33 -t 34 -m 250

This is the shortened output:

  7:14:10.214   171G / 173G  INFO   StageManager             (stage.cpp                 : 126)   STAGE == Preliminary Distance Estimation
  7:14:10.214   171G / 173G  INFO    General                 (distance_estimation.cpp   : 224)   Processing library #0
  7:14:10.214   171G / 173G  INFO    General                 (distance_estimation.cpp   : 130)   Weight Filter Done
  7:14:10.214   171G / 173G  INFO   DistanceEstimator        (distance_estimation.hpp   : 116)   Using SIMPLE distance estimator
  9:17:49.559   185G / 193G  INFO    General                 (distance_estimation.cpp   :  35)   Filtering info
  9:17:49.559   185G / 193G  INFO    General                 (pair_info_filters.hpp     : 242)   Start filtering; index size: 579977625
  9:22:57.444   200G / 209G  INFO    General                 (pair_info_filters.hpp     : 263)   Done filtering
  9:22:57.944   184G / 209G  INFO    General                 (distance_estimation.cpp   : 172)   Refining clustered pair information
  9:24:50.102   184G / 209G  INFO    General                 (distance_estimation.cpp   : 174)   The refining of clustered pair information has been finished
  9:24:50.102   184G / 209G  INFO    General                 (distance_estimation.cpp   : 176)   Improving paired information
  9:41:24.351   187G / 219G  INFO   PairInfoImprover         (pair_info_improver.hpp    : 103)   Paired info stats: missing = 38388342; contradictional = 0
 10:00:26.002   188G / 225G  INFO   PairInfoImprover         (pair_info_improver.hpp    : 103)   Paired info stats: missing = 6099975; contradictional = 0
 10:00:26.002   188G / 225G  INFO    General                 (distance_estimation.cpp   : 183)   Filling scaffolding index
 10:00:26.002   188G / 225G  INFO   DistanceEstimator        (distance_estimation.hpp   : 116)   Using SMOOTHING distance estimator
 12:21:36.894   192G / 225G  INFO    General                 (distance_estimation.cpp   :  35)   Filtering info
 12:21:36.905   192G / 225G  INFO    General                 (pair_info_filters.hpp     : 242)   Start filtering; index size: 97970244
 12:23:04.656   196G / 225G  INFO    General                 (pair_info_filters.hpp     : 263)   Done filtering
 12:23:05.117   192G / 225G  INFO    General                 (distance_estimation.cpp   : 228)   Clearing raw paired index

== Error ==  system call for: "['spades', 'K55/configs/config.info', K55/configs/mda_mode.info', 'K55/configs/meta_mode.info']" finished abnormally, err code: -9

So it seems to fail for the large k-mer size 55. What is the meaning of the error code -9?

Thanks!

asl commented 6 years ago

The place looks weird. Does the problem reproduce when you re-run the SPAdes job?

fungs commented 6 years ago

yes, I did, same thing

asl commented 6 years ago

Ok. Most probably we'll need the data to reproduce and fix the issue. Will it be possible for you to share it with us?

fungs commented 6 years ago

Sorry, I don't think I can share these data at this point. I will try to run it on another machine, see if it results in the same error. It must have to do with data size, as I assembled splits of one fourth successfully.

Let me see if I can get a stripped version to reproduce the error. Thanks for your prompt responses, you are really motivated there! You don't see that very often on academic open-source projects.

fungs commented 6 years ago

I coudn't find out what the exit code means by digging through the source code, maybe you can!

asl commented 6 years ago

Here the process was killed (-9 is SIGKILL signal). The biggest question is the reason. Do you happen to have enough free RAM available?

Momosandy commented 6 years ago

Sames issue with hammer

fungs commented 6 years ago

I went through the (user-readable) system log but couldn't find anything about killing the process. Overall, this is what I suspected first. I will try to run in on different hardware to see if that runs through.

Momosandy commented 6 years ago

Thank you Johannes,

What is the limit I am allowed in terms of number of thread? and CPU on Abel? How much RAM am I allocated on Abel?

How could I get some more memory in order to run this analysis?

Sundy Maurice Researcher Mycology Group Section for Genetics and Evolutionary Biology University of Oslo Norway

sandym@ibv.uio.nomailto:sandym@ibv.uio.no +47-22845918

On 5 Feb 2018, at 11:18, Johannes Dröge notifications@github.com<mailto:notifications@github.com> wrote:

I went through the (user-readable) system log but couldn't find anything about killing the process. Overall, this is what I suspected first. I will try to run in on different hardware to see if that runs through.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/ablab/spades/issues/76#issuecomment-363040811, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYTZpmIgx_vUlqZ-U-sukcAvuVFu8SXPks5tRtVlgaJpZM4R28-6.

fungs commented 6 years ago

@asl: I ran again on this hardware and could find an out-of-memory kill in the log, + it stopped in another step. I guess the only thing remaining for the development side is to better report such issues and communicate to the user

@Momosandy: Hi, I don't know the limits for your specific system, you need to talk to your tech support. On cluster systems, there are usually different queues and options to allocation resources like RAM. If it crashes on your system, try to run the following to verify:

dmesg -T

It should give you a long list of kernel messages, among them should be an entry such as in my case:

[Sa Feb  3 00:21:41 2018] Out of memory: Kill process 136340 (spades) score 270 or sacrifice child
Momosandy commented 6 years ago

Hi Johannes, thank you for testing it on your side.

@ ANton: Any idea?

Sundy Maurice Researcher Mycology Group Section for Genetics and Evolutionary Biology University of Oslo Norway

sandym@ibv.uio.nomailto:sandym@ibv.uio.no +47-22845918

On 5 Feb 2018, at 11:29, Johannes Dröge notifications@github.com<mailto:notifications@github.com> wrote:

@aslhttps://github.com/asl: I ran again on this hardware and could find an out-of-memory kill in the log, + it stopped in another step. I guess the only thing remaining for the development side is to better report such issues and communicate to the user

@Momosandyhttps://github.com/momosandy: Hi, I don't know the limits for your specific system, you need to talk to your tech support. On cluster systems, there are usually different queues and options to allocation resources like RAM. If it crashes on your system, try to run the following to verify:

dmesg -T

It should give you a long list of kernel messages, among them should be an entry such as in my case:

[Sa Feb 3 00:21:41 2018] Out of memory: Kill process 136340 (spades) score 270 or sacrifice child

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ablab/spades/issues/76#issuecomment-363043601, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYTZpgmgWRDEFg9kzq7i1YuXnnoeDR4Tks5tRtf_gaJpZM4R28-6.

asl commented 6 years ago

@Momosandy I have no idea about your cluster configuration, you'd better ask your system administrator

fungs commented 6 years ago

I changed the title, please consider changing the handling of signals so the user knows what is going on. In my case, the process was killed by the system due to high memory usage.

asl commented 6 years ago

@fungs Unfortunately, there is no way to tell that the process was killed due to OOM killer. We're only seeing SIGKILL signal. Though when we know the reason (e.g. the system failed to allocate RAM for us and let us know this) we do report it.

fungs commented 6 years ago

yes, report that the process was killed!

fungs commented 6 years ago

Maybe you know what error code -9 means but since error codes are not standardized and not all users on a system have access to system logs, it would be much better to point the user to the source of the error as best as possible (also to avoid unnecessary issues on github).

mmokrejs commented 6 years ago

I agree SPAdes should colect dmesg -T in such cases. In a clustered environment, it is hard to get access to the same computational node after the job has exited. I wouldn't mind if output of free command was captured too.

fungs commented 6 years ago

@mmokrejs: that would probably go beyond what most programs do (and it is quite system-specific). But it would be good to give a clear message that the process was killed and by what signal. That is often enough to figure out what might have gone wrong.

mmokrejs commented 6 years ago

It does not matter if any of these fails on an obscure platform or not. It is just the very right moment to capture some system info in this situation. Many users are not aware of syslog files and dmesg and as I said, it is quite an effort to get to them after the job finished.

asl commented 6 years ago

We cannot do this reliably - on many systems reading kernel log (aka dmesg) would require root privileges. We certainly do not want to handle errors in error path (double errors are nightmare to handle), instead we would like error path to be as straightforward as possible. Users could check the result code and do whatever state capturing they want, e.g. run dmesg, free, df, squeue, etc. – whatever diagnostic tools are available on the system.

mmokrejs commented 6 years ago

As I said, it is hard to get access back to the computational node, it could have been even rebooted meanwhile. I do not mind if execuition of e.g. squeue fails as it is not available on the system, my point is: try to get any info which is available.