spatel-gfb commented 2 years ago

[GenomicsDB]

As part of our pipeline, we are running the below mentioned command. What this step does is, reads the vcf files in batch of 50 so in total 4 batches will run for 156 samples, and then for each batch it writes the tables/data to the “genomicsdb” folder.

Parent Command : python /gatk/gatk --java-options -Xmx4g -Xms4g GenomicsDBImport --genomicsdb-workspace-path genomicsdb --batch-size 50 -L chrX:51630606-68003941 --sample-name-map inputs.list --reader-threads 5 -ip 500 --gcs-project-for-requester-pays broad-dsde-methods

Child Process : java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -Xms4g -jar /gatk/gatk-package-4.1.8.1-local.jar GenomicsDBImport --genomicsdb-workspace-path genomicsdb --batch-size 50 -L chrX:51630606-68003941 --sample-name-map inputs.list --reader-threads 5 -ip 500 --gcs-project-for-requester-pays broad-dsde-methods

The above command took approx. 3.5 hrs to run while writing to local mount of ec2 i.e. EBS volume. The same command took 8+ hrs (still running as of this email) to run while writing to FSx for luster mount. And surprisingly through AWS Batch – EC2 as part of complete batch/pipeline, took 40+ hrs.

The files being read by this process are already cached into FSx as we have been using this same FSx for 5+ days now and these jobs already succeeded with 30-40 hrs of runtime.

While we were testing the below manual execution, nothing was running from batch or FSx perspective. Only the 2 manual jobs - one for writing it to local (EBS) and other for FSx.

The FSx we are using is the scratch system type with 16.8 TB of space, which gives us a total throughput of 3.3 GB/s.

Below is the snapshot of batch 1 executions.

EBS Mount Run : Took a total of 1 hr in batch 1 EBS Mount Run Batch 1

FSX Mount Run : Took 2 hrs 11 mins in batch 1 FSX Run Batch 1

But when the “dd” command to test the write speeds for both the file systems, the FSx shows a much greater speed/performance.

Command : dd if=/dev/zero of=<Local/FSx>/test.img bs=1G count=5 oflag=dsync

Ran to write on local (ec2 EBS mount) :

root@6ece7fab91ec:/app# dd if=/dev/zero of=/app/test.img bs=1G count=5 oflag=dsync 5+0 records in 5+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 51.5764 s, 104 MB/s root@6ece7fab91ec:/app#

Ran to write on FSx for luster mount on ec2:

[root@ip-10-76-63-158 genomicsdb]# dd if=/dev/zero of=/gfb-dev-sv-fsx-results-us-east-2/cromwell-execution/GATKSVPipelineBatch/087bd722-5f51-43eb-a89e-70846a1da89f/call-GATKSVPipelinePhase1/GATKSVPipelinePhase1/38595c13-b874-4753-a554-81c09f6449f8/call-GatherBatchEvidence/GatherBatchEvidence/c8120761-6d9f-4bd3-b450-f528b7be817c/call-BAFFromGVCFs/BAFFromGVCFs/d5032666-9c09-4857-a8d7-41042927cf89/call-ImportGVCFs/shard-389/genomicsdb/test.img bs=1G count=5 oflag=dsync 5+0 records in 5+0 records out 5368709120 bytes (5.4 GB) copied, 23.5143 s, 228 MB/s [root@ip-10-76-63-158 genomicsdb]#

We also ran the jobs with strace enabled and we found that there are millions of FUTEX_WAIT_PRIVATE processes while we run the jobs for fsx writing as compared to just 26 when we write to EBS.

Local EBS writing strace log (Ran around 3.5 hrs)

[root@ip-10-76-62-193 importvcf-job]# egrep "FUTEX_WAIT_PRIVATE, 0, NULL" ./local-write-logs/strace_local_writing.log | wc -l 26

FSx strace logs

--reader-threads 5 (Ran around 7 hours)

[root@ip-10-76-62-193 importvcf-job]# egrep "FUTEX_WAIT_PRIVATE, 0, NULL" ./fsx-write-logs/strace_fsx_writing.log | wc -l 24378265

--reader-threads 1

[root@ip-10-76-62-193 importvcf-job]# egrep "FUTEX_WAIT_PRIVATE, 0, NULL" ./strace_thread_1_fsx.txt | wc -l 8745113

--reader-threads 2

[root@ip-10-76-62-193 importvcf-job]# egrep "FUTEX_WAIT_PRIVATE, 0, NULL" ./strace_thread_2_fsx.txt | wc -l 13946622

--reader-threads 10

[root@ip-10-76-62-193 importvcf-job]# egrep "FUTEX_WAIT_PRIVATE, 0, NULL" ./strace_thread_10_fsx.txt | wc -l 13535883 [root@ip-10-76-62-193 importvcf-job]#

The last 3 i.e. tests for thread 1, 2 and 10 were only executed for 20 mins and in those 20 minutes it only loaded around 220 MBs to the genomicsdb.

Note that the above executions were done on different EC2 instances with 4 CPU and 30 GB memory (each running 1 job only at a given time). FSx executions were done one after the other.

Please help us with troubleshooting this performance issue.

droazen commented 2 years ago

@spatel-gfb Can you try running GenomicsDBImport with the --genomicsdb-shared-posixfs-optimizations argument and see if that helps?

@nalinigans @mlathara any other suggestions?

nalinigans commented 2 years ago

@droazen, yes for luster, we should pass the --genomicsdb-shared-posixfs-optimizations especially with GenomicsDBImport.

spatel-gfb commented 2 years ago

Thanks @droazen and @nalinigans , I will try it out and get back with results.

spatel-gfb commented 2 years ago

Hello @droazen and @nalinigans , we tried out the parameter you suggested and did improve the performance but not as great as compared to writing the data locally and then moving to FSx for luster.

Using the parameter, the job completed in ~2 hrs 52 mins, whereas when we run to write on EBS volumes it completes within 1 hr.

nalinigans commented 2 years ago

Thanks @spatel-gfb. The consolidate argument is what will give you the query performance as the consolidation happens in memory when running GenotypeGVCFs or SelectVariants otherwise. Working on the fix for the consolidation issue, hopefully will have it in a branch in a day or so.

Yes, running queries from a local disk will be the fastest, but running from NFS/Lustre/Cloud Storage should be comparable.

spatel-gfb commented 2 years ago

Thanks @nalinigans , will it also help in Importing VCFs to the genomics db ?

nalinigans commented 2 years ago

Modifying what I wrote earlier, got confused with another issue.

I am not familiar with Lustre and Lustre configuration. Did the excessive file locking from Lustre(FUTEX_WAIT_PRIVATE?) go away with --genomicsdb-shared-posixfs-optimizations?

Is there anyway to configure Lustre buffer sizes for writing? If not, can you try setting environment variable TILEDB_UPLOAD_BUFFER_SIZE to something like 5242880(5M) and try GenomicsDBImport? Does it help with performance? Is the amount of file locking lower than before?

If the performance is still not acceptable... What version of gatk are you using? Can you use the latest gatk and try using the --bypass-feature-reader option with GenomicsDBImport? Does this help with performance?

spatel-gfb commented 2 years ago

Hello @nalinigans,

As part of gatk-sv pipeline we are using GATK : v4.1.8.1 which doesn't have bypass-feature-reader option. Also, we didn’t capture strace for the run with just "--genomicsdb-shared-posixfs-optimizations" so wont be able to share the FUTEX process counts.

So after using v4.2.4.1 we get below results.

- Using "--genomicsdb-shared-posixfs-optimizations" & "--bypass-feature-reader" the process took 118 mins.
       "FUTEX_WAIT_PRIVATE, 0, NULL" : 1266

- Using "--genomicsdb-shared-posixfs-optimizations" & "--bypass-feature-reader" and

TILEDB_UPLOAD_BUFFER_SIZE=5242880 as env variable the process took 113 mins. "FUTEX_WAIT_PRIVATE, 0, NULL" : 3

- Even using 10 MB as buffer size resulted in same execution time of 113 mins.
- Using a buffer size bigger i.e. 50 MBs caused the process to run slower so we aborted it.

Please let us know if we can improve it further.

mlathara commented 2 years ago

It's hard to say what is going on here exactly...but a couple of clarifying questions here. You stated that importing to EBS was roughly 2x faster than Lustre (~1hr vs ~2hrs). Is that right? Are the vcfs on Lustre in both cases?

We're not sure exactly what is going on here...generally we do see some slowdown on shared filesystems compared to just private mounted disks. Do you have access to any CloudWatch metrics or the like? I'm not entirely sure what to look for there, but looking at IOPs over time, or transactions per second, read/write transaction size, etc might help show something interesting.

ldgauthier commented 2 years ago

@spatel-gfb another option that should improve performance is to run the GATK GVCF postprocessing tool ReblockGVCF on all of your input GVCFs. That tool merges reference blocks (which BAF calculation doesn't need) and throws out uncalled alleles (which BAF calculation doesn't need), both of which should speed up your import. For the purposes of GATK-SV you can use the default GQ bands ([0,20], and (20, 99]).

spatel-gfb commented 2 years ago

Thanks @ldgauthier , we will try that option out in few weeks. For now as a work-around for this, we are writing it to local EBS volumes which gives best performance for this job and then we move the files to the shared storage.

broadinstitute / gatk

SV Pipeline Jobs LongRunning : BAFFromGVCFs_ImportGVCFs and GenotypeVCFs #7646

Local EBS writing strace log (Ran around 3.5 hrs)

FSx strace logs

--reader-threads 5 (Ran around 7 hours)

--reader-threads 1

--reader-threads 2

--reader-threads 10