Inconsistent pipeline performance when disk space is exceeded

databio / pepatac

A modular, containerized pipeline for ATAC-seq data processing

http://pepatac.databio.org

BSD 2-Clause "Simplified" License

54 stars 15 forks source link

Inconsistent pipeline performance when disk space is exceeded #30

Closed rcorces closed 6 years ago

rcorces commented 7 years ago

This seems like it could be an important issue. TL;DR - I ran the pipeline twice on certain samples and got vastly different file sizes for the aligned bams (differences of 2 - 3 GB) despite both giving a completed flag. I think this happened because I ran out of disk space.

I was looking at the total aligned reads for certain samples and some were alarmingly low (on the scale of 10,000 aligned reads) but the sample had an ATACseq_completed.flag. I re-ran the pipeline on the same fastqs and got a very different result the second time. This is not an issue of non-deterministic vs deterministic. This looks like something failed the first time but wasnt caught and the pipeline forged onwards to completion.

My best guess is that this all happened during a single looper run where I ran out of disk space.

I'm attaching two example logs (from the good run I just did and the truncated run that happened previously). You can see the spots where I ran out of disk space.

Good_STAD-91C7155F-8187-4688-BEAB-C4179C786667-X029-S08-L015-B1-T1-P081_ATACseq_log.md.txt Good_THCA-ABF4154C-FACF-4711-AAC7-35C560F39BC2-X024-S10-L022-B1-T2-P064_ATACseq_log.md.txt Truncated_STAD-91C7155F-8187-4688-BEAB-C4179C786667-X029-S08-L015-B1-T1-P081_ATACseq_log.md.txt Truncated_THCA-ABF4154C-FACF-4711-AAC7-35C560F39BC2-X024-S10-L022-B1-T2-P064_ATACseq_log.md.txt

nsheff commented 7 years ago

I'm looking into it. For the truncated THCA example -- the pipeline did fail, but then it looks like it was restarted. Did you restart any of these processes? They should not restart from looper as it should have a failed flag that would tell you to manually investigate.

Does the truncated STAD run have a failed flag as well as a completed flag?

rcorces commented 7 years ago

I definitely manually restarted it with looper. And yes, the STAD has a failed.flag and a completed.flag. But I cant say for sure why that is. Because I ran out of disk space, I had to rsync to long term storage, delete the completed entries, and restart the failed entries. So the failed flag would have rsync'd and that could be why it is present in my storage folders.

rcorces commented 6 years ago

Just to give you insight into my psyche with how I dealt with looper in the past - Since I havent bothered to figure out why the pipeline is freezing on sherlock at the TSS_enrichment / fragLdistribution steps, I end up with a lot of failed processes. Often times, I will find all of the ATACseq_failed.flag files and delete them without looking one by one at why the pipeline failed. I would assume what got me into trouble here is that I also used to delete any lock. or recover. files prior to restarting the pipeline. I dont do this anymore because you introduced me to the --recover option.

This hasnt been a problem in the past. But presumably the combination of this manual restarting and the disk quota exceeded issue caused the problem. I'm ashamed to be such a lazy and erratic end-user but I guess that is who I am...

nsheff commented 6 years ago

Well my first thought was that the thing that's supposed to prevent this is that you shouldn't restart a failed process without knowing why you're restarting it, and making sure you clean it up manually, which is why it requires a manual restart. So I was really confused about that, because if those processes didn't finish because they failed, then the file lock should still be there, which would prevent this from happening either way.

So if you're manually deleting the lock files and the fail flags, and then restarting looper, then this explains exactly how this could happen.

It's OK to delete a lock file, but then you better delete the file that was locked as well, because if it exists, it will be incomplete. that's exactly what's happening here. The next run just picks up the incomplete version, since you deleted the lock.

In this situation, what you should do is: when you see a bunch of stuff failed, figure out why it failed (disk use quota), solve that problem, and then resubmit from looper with --ignore-flags and --recover. That's it; no manual deleting lock files, recover files, or flags. So really, the intended use case is even easier than you're doing, you're making more work for yourself (aside from breaking it :smile:). You definitely want to use --recover in those situations for exactly this reason. That will restart that process from the beginning instead of using truncated file. If you do it this way, you should never have any issues with incomplete files. This is actually one of the primary reasons I made pypiper in the first place.

Generally speaking, I wouldn't delete lock files manually, ever, except in very specific situations that I knew exactly why I was doing that...that is what recover is intended for.

So it seems like this is just a problem in the way you were using it. I'm really glad you bring these issues up, though -- any way I can make the docs for recover, or maybe lock files, more clear for future users?

rcorces commented 6 years ago

All of that makes sense. And I now use --recover so it wont be a problem in the future.

My current understanding of --recover is that it is a pyPiper command and must be specified in the looper config.yaml file in the pipeline_args section. The way you talk about it, it kind of sounds like it could also be specified in the looper run command like looper run --compute all --recover. I would say that is not very clear in the current readthedocs for pyPiper.

My understanding of --ignore-flags is that it will ignore ALL flags (including completed.flag which is a problem if you dont keep intermediate files because this will use compute time for nothing). Even if that interpretation is incorrect, the docs still say that it will ignore running.flag. From my end-user perspective, this makes it less straight forward to use --ignore-flags. I'll try to explain why below:

Most of the time, I submit ~50-100 samples to looper. I typically am only able to actively run about 15 of those. So, upon submission, lets say 15 jobs get started. Those jobs then create flags etc and start generating files in the results_pipeline folder. Eventually, some of them finish, others get preempted, others fail, and others haven't yet been started (and thus have no flags).

If I were to try to restart the failed jobs, I feel like I cant use looper with the same config file because (A) it will try to resubmit the jobs that havent been started yet and (B) if I use --ignore-flags it will resubmit jobs with a running.flag. If it resubmits those files, doesnt that create a second instance of that job?

Some of this is problematic because I dont want to wait for all of my jobs to finish to assess which jobs need to be restarted etc. Or maybe Im just interpreting all of that incorrectly.

If my interpretation is correct, it could be useful to have an --ignore-failed (just failed) option in addition to --ignore-flags (all flags).

nsheff commented 6 years ago

My current understanding of --recover is that it is a pyPiper command and must be specified in the looper config.yaml file in the pipeline_args section. The way you talk about it, it kind of sounds like it could also be specified in the looper run command like looper run --compute all --recover. I would say that is not very clear in the current readthedocs for pyPiper.

Yes, you can actually do that. You're right, that's nowhere in the docs. I will add it. Any command-line arguments passed to looper run that are not consumed by looper will simply be handed off untouched to all the pipelines. This gives you a handy way to pass-through command-line arguments that you want passed to every job in a given looper run. For example, you may run looper run config.yaml -R -- since looperdoes not understand -R, this will be passed to every pipeline.

If my interpretation is correct, it could be useful to have an --ignore-failed (just failed) option in addition to --ignore-flags (all flags).

Your interpretation is correct. Yes, I have thought about that as well, actually. It would be useful. Or, what about a looper rerun command, that would only run jobs where it found a failed flag? That differs because it wouldn't create new jobs for ones that are queued, but not yet running... I think that may be even better.

rcorces commented 6 years ago

Any command-line arguments passed to looper run that are not consumed by looper will simply be handed off untouched to all the pipelines. This gives you a handy way to pass-through command-line arguments that you want passed to every job in a given looper run. For example, you may run looper run config.yaml -R -- since looperdoes not understand -R, this will be passed to every pipeline.

Does the order/placement of -R matter? I would have naturally placed it as looper run -R config.yaml. If it matters, maybe specify an example like what you said in your comment when you edit the docs?

what about a looper rerun command, that would only run jobs where it found a failed flag? That differs because it wouldn't create new jobs for ones that are queued, but not yet running

That is a GREAT idea. I agree - better than --ignore-failed. I guess looper rerun would by default use --recover in that case?

nsheff commented 6 years ago

Does the order/placement of -R matter? I would have naturally placed it as looper run -R config.yaml. If it matters, maybe specify an example like what you said in your comment when you edit the docs?

Uhh I'm not sure...I think as long as it comes after run it's OK, but this is a python argparser thing and it gets complicated. I think you just have to try it (try it first with dry-run -d so you don't submit, and see if it gets appended). I am pretty sure it will work however you do it.

That is a GREAT idea. I agree - better than --ignore-failed. I guess looper rerun would by default use --recover in that case?

Well, no... because -R is pypiper-specific, so you'd have to do looper rerun -R... for us you would always want to do that (rerun without -R would get stuck in the endless waiting state), but looper could potentially be used with non-pypiper pipelines, so I'm trying to avoid hard-coding pypiper-specific arguments into it... otherwise rerun would be unusable for any command that didn't understand -R

rcorces commented 6 years ago

makes sense! I think we've beaten this "issue" to death. Feel free to close.