PharmGKB / PharmCAT

The Pharmacogenomic Clinical Annotation Tool
Mozilla Public License 2.0
120 stars 39 forks source link

Enhance preprocessor performance #141

Closed anh151 closed 2 months ago

anh151 commented 1 year ago

Hello again,

One of the challenges that we have run into is that as the number of samples grow, the speed of splitting the VCF into single sample VCFs grows more than linearly with the number of samples. Part of the reason is that bcftools has to spend so much time looking for each sample, parse and read the entire file for each sample.

As an example, running the preprocessor with ~1,500 samples, the compute time is negligible (<30 seconds). However, with the new release of All of Us data (n=~250k) the compute time to create each individual VCF is about 15 sec/sample. As the size of the data continues to grow and as we have to update to match changes in the data, updates to PharmVar, CPIC, etc, the compute time of the preprocessor becomes an important consideration. I did also try to use the multi-sample VCF as input directly into PharmCAT and the performance wasn't great either likely for the same reasons.

Has there been any thought to improving its performance? One thought I had was, why use bcftools to split the samples. If the VCF file is cleaned and prepped for PharmCAT with the only exception being that its multi-sample, splitting the samples can be easily achieved with huge performance improvement using Python or bash commands. Example code that I quickly put together is below. This was able to write a file for each sample (n=250K) in ~10 mins using a CPU with only 2 cores.

import csv
import concurrent.futures

headers = []
data = []
with open("vcf_corrected.vcf") as f_in:
    csv_reader = csv.reader(f_in, delimiter='\t')
    for line in csv_reader:
        if line[0].startswith("##"):
            headers.append(line)
        elif line[0] == "#CHROM":
            col_names = line
        else:
            data.append(line)
# simplifies excess headers in All of Us file
headers = headers[:10]

def write_output(i):
    sample_id = col_names[i]
    with open(f"{dir_name}/{sample_id}.vcf", 'w') as f:
        csv_writer = csv.writer(f, delimiter='\t')
        csv_writer.writerows(headers)
        csv_writer.writerow(col_names[:9] + [col_names[i]])
        for line in data:
            csv_writer.writerow(line[:9] + [line[i]])

dir_name = 'test'
os.mkdir(dir_name)

counters = range(9, len(col_names))
with concurrent.futures.ThreadPoolExecutor() as e:
    e.map(write_output, counters)

Let me know your thoughts. -Andrew

markwoon commented 1 year ago

FYI, the latest version of PharmCAT does not require single sample VCF, and the latest version of the preprocessor no longer splits them by default. We'd love for feedback on this change, although have some caveats that we're trying to document. It's a work in progress.

As for using python instead of bcftools, as mentioned above, we're discovering there are some unexpected issues with using multi-sample VCF's. @BinglanLi has the details and we're trying to figure out the best way forward. So yes, splitting a multi-sample VCF with python may be faster, but introduces some problems.

One note on the latest release (2.4.0): we've found a problem with how it calls SLCO1B1, and I'm in the middle of fixing it. Would suggest it for testing, but not for production use right now.

anh151 commented 1 year ago

I did try just creating the multi-sample VCF and using that directly with PharmCAT but it still felt slow to me. This was a while back but I can try it again. Are there certain parameters that would optimize the run using this approach such as CPUs and memory? Max CPUs available on a single VM through All of Us is 96 with memory options of 86.4, 360 or 624 GB.

I figured I might be oversimplifying the problem using Python but I was curious to hear your approaches.

Thanks for the heads up on SLCO1B1.

Another alternative approach on my end might be to use bcftools to split the large VCF into smaller multi-sample VCFs and use those as repeated input into the preprocessor.

markwoon commented 1 year ago

We've been working on improving PharmCAT's performance, but haven't had a good opportunity to really focus on it.

I've added BatchPharmCAT in addition to PharmCAT, which allows you to run PharmCAT concurrently. I don't have docs yet because I'm still testing it out. You're welcome to try it out. Using the -h flag will give you all the options, but the new one you'll care about is -cp for the number of concurrent processes.

The new pharmcat_pipeline script uses it under the hood. The docs there may help, but the arguements it takes is not a 1-1 mapping to BatchPharmCAT.

In our testing, it can speed things up quite a bit. However we're noticing that with larger multisample VCF files things fall off precipitously after about 50K samples. I'm not sure if this has something to do with # of processes, amount of memory, or something else. While I'd love to find time to investigate and improve this, but since we've got no funding at the moment, I'm prioritizing bugs over features/speed.

anh151 commented 1 year ago

That's very helpful. I'll try some tests on my end and follow up.

Is the problem in SLCO1B1 calling exist in previous versions as well? We used v2.2.1 for the All of Us PGx paper. Does that have the same issue?

markwoon commented 1 year ago

The SLCO1B1 problem is only in the HTML report. If you're only using the JSON results you are not affected.

Results from 2.2.1 should not be affected at all.

anh151 commented 1 year ago

Yeah we only use the JSON so it won't affect us.

Following up on our previous discussion.

I ended up splitting the mutli-sample VCF (n=250k) into smaller subsets of the samples (n=~5,000) prior to using the preprocessor. This improved performance of splitting into single sample VCFs by about 30x. So it made a big difference. I also tried using the smaller (n=5,000) multi-sample VCFs directly with pharmcat_pipeline and that seemed to go really fast as well. Compute time was ~4.6 secs/sample. I think splitting into these smaller subsets will probably be the approach long-term. Example code below:

run_command(f"./bin/bcftools query -l {vcf_compressed} > all_sample_ids.txt", shell=True)
all_sample_ids = np.loadtxt("all_sample_ids.txt", dtype=np.int64)
id_chunks = np.array_split(all_sample_ids, all_sample_ids.shape[0]//5_000)
for id_chunk in id_chunks:
    id_fname = "sample_ids_selected.txt"
    np.savetxt(id_fname, id_chunk, fmt="%i") # saves as int
    vcf_fname = os.path.join(cwd, "filtered_samples.vcf.gz")
    run_command(["./bin/bcftools", "view", "-S", id_fname, vcf_compressed, "-Oz", "-o", vcf_fname])
    run_command(["./bin/tabix", "-p", "vcf", vcf_fname])
    # Run preprocessor or PharmCAT