HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
247 stars 27 forks source link

Inefficient in high-throughput, targeted amplicon use case #306

Closed Permafacture closed 3 months ago

Permafacture commented 6 months ago

Our use case is variant calling on the output of a targeted amplicon assay on many samples in parallel. The way that clair3_c_impl.sh parallelizes the pileup calling step in this case is inefficient. A single process that accumulates all of the candidates across all of the contigs and then does variant calling on them works great, but the best case for the current script is a separate process per contig (if you edit the hard coded max_chunk_size and set the chunk_size to be bigger than any single contig).

In the current implementation, each process works on a number of candidates far smaller than the networks batch size. This involves a number of threads that are doing nothing or very little which limits the amount of real parallelization we can do in our pipeline before the CPU gets heavily overloaded. The example I have been looking at compares the current implementation vs a single process batching all of the pileup candidates, both using a single thread, and the results are over 12 minutes vs under 2 minutes.

Currently we have a patch that works for us but I'd prefer this use case were officially supported so there's no chance my patch will break on new versions. Two possible solutions are 1) when a bed file is provided, don't use GNU parallel and instead use a single process that accumulates candidates from the bed file regions, or 2) a new flag that bypasses GNU parallel and accumulates candidates across the entire genome. It might be nice in these cases for the user provided number of threads to be used by tensorflow instead.

In general, since the candidate proposal step is so fast compared the network prediction or even process start time, it seems to me like accumulating all of the candidates and then using the available threads for processing those batches would be much more efficient in all cases. This would fully utilize the network batch size where chunking the genome will never give even and consistent batches of candidates. So I wonder if the current parallelization strategy is beneficial over my proposal in any situations.

I'd be happy to start a pull request once I know what you are open to having changed.

Thanks!

aquaskyline commented 6 months ago

It isn't easy to decide which is the best parallelization strategy for Clair3 as it can be used for many purposes. The largest unit per process is a sequence in Clair3 (through setting max_chunk_size and chunk_size) because for some reasons, such as, the intermediate phasing step, as the most sequence context-dependent step, can use the whole context of a single sequence at best. The candidate proposal step does not always run fast, especially when the coverage is low.

If there should be a solution to serialize the candidate proposal step, I prefer it to be a flag-controlled behavior. And I prefer additional code and minimal changes to the existing code.

Permafacture commented 6 months ago

I'd be interested to see a BAM where candidate generation is particularly slow so I can understand the situation better. Do you have one you can share?

Here's my proposal

Add a flag no-chunk-genome (or alternatively use --chunk_num=0) which CheckEnvs interprets and outputs a CHUNK_LIST that has None None None in it. CallVariantsFromCffi then chooses from three different candidate generators. if contig is not None, use the current generator. If contig is None and a bed file is provided, generate candidates from the regions in the bed file. If contig is None and no bed file, generate candidates from all contigs represented in the BAM.

If I can get a BAM with a significant number of candidates in it, I'd also like to look at letting tensorflow use the threads GNU parallel isn't using in the case where contig is None. If there's a significant performance improvement to letting python use multiple threads I would propose something like this pseudocode for the clair3_c_impl.sh script:

if no-chunk-genome:
    PARALLEL_THREADS=1
    INTERNAL_THREADS=THREADS
else:
    PARALLEL_THREADS=THREADS_LOW
    INTERNAL_THREADS=1

parallel -j ${PARALLEL_THREADS} "${PYTHON} ${CLAIR3} CallVariantsFromCffi. --threads=${INTERNAL_THREADS} ..."
aquaskyline commented 6 months ago

I'd be interested to see a BAM where candidate generation is particularly slow so I can understand the situation better. Do you have one you can share?

Any whole genome BAM with average depth between 6-10x. As you can see, this is a situation totally different from targeted amplicon use case. Both are very common use cases that Clair3 should optimize for both in a long run.

Here's my proposal

Add a flag no-chunk-genome (or alternatively use --chunk_num=0) which CheckEnvs interprets and outputs a CHUNK_LIST that has None None None in it. CallVariantsFromCffi then chooses from three different candidate generators. if contig is not None, use the current generator. If contig is None and a bed file is provided, generate candidates from the regions in the bed file. If contig is None and no bed file, generate candidates from all contigs represented in the BAM.

--chunk_num=0 sounds like a plan.

If I can get a BAM with a significant number of candidates in it, I'd also like to look at letting tensorflow use the threads GNU parallel isn't using in the case where contig is None. If there's a significant performance improvement to letting python use multiple threads I would propose something like this pseudocode for the clair3_c_impl.sh script:

if no-chunk-genome:
    PARALLEL_THREADS=1
    INTERNAL_THREADS=THREADS
else:
    PARALLEL_THREADS=THREADS_LOW
    INTERNAL_THREADS=1

parallel -j ${PARALLEL_THREADS} "${PYTHON} ${CLAIR3} CallVariantsFromCffi. --threads=${INTERNAL_THREADS} ..."

Tensorflow takes up quite some memory, so not setting INTERNAL_THREADS to 1 is to improve the usability of Clair3 in some memory constrained environments. Setting INTERNAL_THREADS to a constant 4 was empirical also considering that Tensorflow libraries are huge, the overhead of starting many Tensorflow processes was pretty significant.

Permafacture commented 5 months ago

I don't follow. Currently INTERAL_THREADS is hard coded to 1 in the Run method of CallVariantsFromCFFI. The current behavior is to start up many tensorflow processes and pay that significant overhead. I preserved the current behavior in my pseudocode above. My proposal is to not start up many tensorflow processes but to have only one using multiple cores. Sounds like we agree that this will be more memory efficient and have less overhead.

We don't do whole genome sequencing, so if you don't have a BAM to link to me I'm sure there's one on the Internet somewhere I can find.

I've got enough to get started. Thanks!