HenrikBengtsson / CBI-software

A Scientific Software Stack for HPC (CentOS oriented)
https://wynton.ucsf.edu/hpc/software/software-repositories.html
5 stars 2 forks source link

SRA Toolkit: Workaround to avoid crashing host due to BeeGFS overload #17

Closed HenrikBengtsson closed 3 years ago

HenrikBengtsson commented 3 years ago

Issue

It turns out that fasterq-dump of the SRA Toolkit software can bring down a machine to the point where w and ps is unresponsive, although things such as ls works. I've troubleshooted this a bit over at UCSF Wynton.

Troubleshooting

Turns out that others have reported similar problems, e.g.

  1. https://github.com/ncbi/sra-tools/issues/463, and
  2. https://github.com/ncbi/sra-tools/issues/161

The 2nd is interesting because there's a comment https://github.com/ncbi/sra-tools/issues/161#issuecomment-808294889 pointing toward BeeGDS:

We have seen issues before with fasterq-dump and beegfs; fasterq-dump does I/O from multiple threads, I suspect (I have no access to an installation of beegfs to verify) the file system driver doesn't like that and deadlocks. (A process can't respond to a signal when it is in kernel space, that why Ctrl+C doesn't work.)

which might also explain why w and ps doesn't work (or?)

Now, the above comment also mentions:

..., fasterq-dump uses a lot of temporary files, it is probably best to create these on a locally attached device. You can set the location with -t|--temp but the default is the cwd. ...

So, the fact that fasterq-dump uses $PWD for temp files could certainly explain why the BeeGFS gets hit too hard causing the above hang. I think the reason why they're using $PWD is that they're worried about small, possibly RAM mounted /tmp folders been filled up with their big files.

BTW, I've posted some follow-up comments, which mostly are feature requests making it possible to change defaults via environment variables.

Patch

I've updated module load CBI sratoolkit to inject --temp "$(mktemp -d)" so that fasterq-dump will default to $TMPDIR instead of $PWD. I do this by defining Bash function:

$ type fasterq-dump
fasterq-dump is a function
fasterq-dump () 
{ 
    command fasterq-dump --temp "$(mktemp -d)" "$@"
}

I've verified (on my local Ubuntu machine) that fasterq-dump SRR000001 works and even fasterq-dump --version (despite that --temp ... option injected).

HenrikBengtsson commented 3 years ago

@hgputnam, I've just deployed these updates on C4, e.g.

$ module load CBI sratoolkit

$ fasterq-dump --version

"fasterq-dump" version 2.11.0

$ type fasterq-dump
fasterq-dump is a function
fasterq-dump () 
{ 
    command fasterq-dump --temp "$(mktemp -d)" "$@"
}
hgputnam commented 3 years ago

Thanks. Do you think I should make listserv/slack announcements for stuff like this?

HenrikBengtsson commented 3 years ago

Nah. I've done some basic testing and it seems to work. Let's cross the fingers that there aren't any corner cases where this causes problem. If that happens, we'll hear about it. Not having this patch in place is probably worse; Wynton dev nodes have gone done over the last year because of this bug.