dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
137 stars 36 forks source link

Build on CentOS7 #267

Open james-vincent opened 2 years ago

james-vincent commented 2 years ago

Has anyone successfully built for CentOS7 with glibc 2.17?

We are unable to build manually or extract a 2.17 glibc based exe from the docker build.

mlin commented 2 years ago

I haven't tried, but devtoolset-8 is the most likely path to success. This would go something like (on top of the other dependencies)

yum install -y -q centos-release-scl
yum install -y -q devtoolset-8-gcc devtoolset-8-gcc-c++ devtoolset-8-make
scl enable devtoolset-8 "cmake -Dtest=ON . && make -j$(nproc) && ctest -V"
jcm6t commented 1 year ago

Belated and for others reading this - yes we have built 1.4.5 on CentOS 7. We did not have success with devtoolsets and had to build it one library at a time. Our initial tests seem to suggest it is stable and working correctly. It is a pretty involved process and requires non-standard installations of gcc, gmake, glibc, custom static libraries, much frustration, and many hours/days. Cmake is especially finicky. It is not a flip a switch kind of process.

If there were a place for community contributions to this project we could upload the executable (caveat emptor)

shaze commented 1 year ago

@jcm6t If you would be prepared to share your binary that would be great. I've spent many hours failing with this.

jcm6t commented 1 year ago

@jcm6t If you would be prepared to share your binary that would be great. I've spent many hours failing with this.

@shaze My advice is to give up with dnanexus. We built it, we thought we had it working with a smaller subset of samples, but then once we ran it with 1300 whole genomes we had bad-alloc errors that we could never fix, including with and without jemalloc. It also failed running with the docker container under singularity. Hours? we wasted at least a month on this. This appears to be minimally supported now, at least in the open source space, there has been no formal release in 2 years after a spate of work around 2020 to mid-21 and AFAIK there is no publicly stated commitment to support this. This looked really promising but this is a codebase that we felt uncomfortable for building our pipelines so we switched to bcftools.

We worked with Petr to fix a bug in bcftools - the thread is here including the pipeline we used: https://github.com/samtools/bcftools/issues/1891

There has been a subsequent fix in bcftools 1.17. See the thread. The only issue is that bcftools doesn't use multi-threads (and it really should) so you'll have to split up your chromosomes to use a parallel cluster. I wish we had spent the time we wasted on dnanexus writing a more flexible genome shatter-and-ligate results pipeline at the time.

shaze commented 1 year ago

@jcm6t

Thank you for your comments -- I appreciate the feedback. This is disappointing.

Regarding the bcftools pipeline. I had understood from the paper that glnexus did more than just merge the files (I don't mean "just" in a derogatory sense since I understand the complexity of that) but did some sort of joint calling using the probabilities of calls across the samples. Have you experience of this and an idea of loss of accuracy if any? Or have I misunderstood.

I presume that using the bcftools range options we don't have to actually shatter physically since this would create a big computational and I/O strain across many samples

We've had reasonable performance with the latest binary release (1.4.1) up to about ~1460 samples. The only problem is that it is extremely memory intensive -- chromosome 1 taking ~850 GB of RAM and memory requirements are linearly related to chromosome size. I can't see why this should be the case. I was hoping to avoid splitting chromosome but we will be recalling about ~1560 samples.

jcm6t commented 1 year ago

@shaze It sounds like you are also seeing memory management issues also and that was one of our problems.

No, we explicitly did not want an 'intelligent' population genotyper, we wanted a plain vanilla gvcf merge but your mileage may differ. IMHO It depends on the individual sample genome mapper and genotyping. With GATK, you are signing up for the whole pipeline, but once you start mixing and matching, the effect of each step is complex and could result in poorer quality results if the population caller overrides individual sample calls depending on the parameters. Under these circumstances, I would want to be very sure and clear about what the population caller is doing under the covers. If this is genome discovery, meh, maybe okay, but if you are using it for clinical genetics you'd need to be very careful.

You can apply population filtering metrics in bcftools after the merge but it won't use joint sample likelihoods to call a genotype in an individual sample that didn't have sufficient evidence - unless you write your own custom population caller. The advantage is that you can tweak exactly what you want to drop or change and set the parameters.

As an example, Illumina have released a white paper showing that their population genotyper doesn't do any better than their individual sample genotype quality from DRAGEN. Of course DRAGEN was developed to be ultra-accurate for rapid sample at a time rare disease diagnosis. And btw, Broad are adopting the DRAGEN mapper into GATK.

The splitting of the chromosome gcvf input into bcftools is not that hard. We simply split chr 1 to 5 in thirds and 6-12 in half and ran then rest of the chrs in one go. But that was a function of our cluster queue limits. There is setting that you can use to ensure that the variants are partitioned between fragment gvcfs exactly without any overlap, even allowing for messy VNTRs so we could have split the chrs in 10s or 100s of pieces in an embarrassingly parallel mode.

good luck.

shaze commented 1 year ago

Thanks -- I was able to call 1530 30-40x data sets squeaking just in the 1TB RAM of our biggest machine

We are exploring the use of DRAGEN too.

Thanks for all your help