brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
262 stars 35 forks source link

Erroneous "index not found" errors #4

Closed jblachly closed 5 years ago

jblachly commented 5 years ago

Brent: Great looking tool. I was able to get it to work occasionally.

When passing many samples via *.bam, and notably, only in multi-threaded mode, one or more of somalier's threads will report

index not found for:<bamfile.bam>

When run in single-thread mode it works (as far as I can tell, but is too slow when working with cohorts of hundreds for me to have waited for results yet). When run directly and only on two samples for which prior runs had given the error, it works fine, even in multithreaded mode. (with only limited testing, admittedly).

Could there be a data race somewhere?

brentp commented 5 years ago

I haven't seen this happen before. You're certain there's an index for every BAM file you specified? Any chance you could narrow it down to a single bam? Does the error message give the name of the problematic bam+index? I'll do more testing in January as I intend to work more on somalier and I appreciate any feedback.

jblachly commented 5 years ago

@brentp Can happen with any BAM when passing a large cohort; running on that BAM itself with just one other always succeeds.

Someone in my group repeated this with a separate cohort and experienced the same thing.

We have not had problems when running in single-threaded mode, which leads me to suspect race. Sorry I don't have time to dig into the code itself.

brentp commented 5 years ago

thanks for following up. I think I may have found a cause and solution for this. Would you try the attached binary and verify that works for you. I tried it on 45 samples and the parallelization is not very good (I'll try to improve the for next release) but it does not hit the problem you describe. I am removing the cluster plot as it's (in our experience) not that informative--as it happens, it's also the cause of the bug. somalier_dev.gz

If you have any additional feedback, it would be much appreciated.

charlesgregory commented 5 years ago

I am the someone from @jblachly 's group. Using the dev binary you posted, I get a lot of error messages like below:

[E::hts_open_format] Failed to open file sample1.bam
[hts-nim] could open 'sample1.bam'
... (repeats for all samples)
[E::hts_open_format] Failed to open file sample20.bam
[hts-nim] could open 'sample20.bam'
must open index before querying

or sometimes this:

[E::hts_open_format] Failed to open file sample1.bam
[hts-nim] could open 'sample1.bam'
... (repeats for all samples)
[E::hts_open_format] Failed to open file sample20.bam
[hts-nim] could open 'sample20.bam'
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

This happens when using too many threads with too many samples it seems. The maximum number of samples I can run with 40 threads without issue seems to be about 20.However the maximum number of samples I can run with 20 threads without issue seems to be about 50. We are trying to run a cohort of about 300 samples.

brentp commented 5 years ago

hi, thanks for following up.

is it possible your file system (presumably NFS or some other network system) limits the number of threads accessing a particular directory or file? Or that you have a low-ish limit (like 1024) on the number of open files?

I'll see if I can extract a more informative message from htslib and propagate that to somalier so you'll have more information.

A minor thing, but I just fixed the error message so that: [hts-nim] could open 'sample20.bam' will now read: [hts-nim] could not open 'sample20.bam'

brentp commented 5 years ago

one way you could get around this is to limit the sites file to the first ~1000 sites (instead of the full 20K) and then run with fewer threads. that will give you a good estimate of relatedness but not take nearly as long to run.

jblachly commented 5 years ago

It looks like your suspicion about limit on open files may be correct. @charlesgregory can provide more details. I am surprised to learn that every thread opens every file (Nopenfiles = t * Nbam) but this seems to be the problem.

Somalier is really useful for screening very large cohorts, but when running with 200-500 samples, I would really love to be able to use more than 2-4 threads (we can bump limits up on our own systems, but when running on shared clusters users may not have control over this).

We agree that the lower-triangular matrix plot was not helpful, whereas the point cloud is very helpful.

brentp commented 5 years ago

I may be able to change the parallelization to get around the requirement of threads * files and still have reasonable performance, but most systems now allow hundreds of thousands of open files, and often millions. I think your best bet is to use a smaller sites file so your job can finish in a reasonable time. I'll see about (getting and) improving the error message for the problem you are encountering.

brentp commented 5 years ago

I had quite a long look at this today and implemented an alternative parallelization (by sample). That would avoid problems you are having but it comes at a cost when there are few samples. Because of this, and the code complexity that would be required to parallelize over both samples and sites, I am going to leave it as it is. I hope this will not be a problem on most systems.

I'm still looking into how to get the error message.

brentp commented 5 years ago

well, actually, I parallelized by sample and it's fast (and simple) enough that I'll probably make that the default in the next release.

brentp commented 5 years ago

Hi, I am attaching a binary that gets around the problem you encountered and has better parallelization, in case you'd like to try it out before the next release.

somalier.gz

brentp commented 5 years ago

I have published a release that resolves this. Thanks again for reporting.

https://github.com/brentp/somalier/releases/tag/v0.1.0