benchmarks and speedups

darked89 commented 3 years ago

Hello,

I have completed TSV to VCF transformation of one Finngen GWAS summary file as a test case. The gwas2vcf was run using Singularity container using:

finngen_R5_AB1_AMOEBIASIS.tsv (16380388 lines)
dbSNP 155
GRCh38 fasta
Singularity 3.7.0
Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz

Both genomic fasta and dbSNP VCF had chromosome ids in the same 1-22,X,Y,MT format, were indexed etc.

the output VCF format:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  amoeb01
1       108391  rs1274919517    A       G       .       PASS    .       ES:SE:LP:ID     0.3653:1.9411:0.0702236:rs1274919517

This was executed in 582.99 mins
usr time 577.51 mins sys time 5.38 mins

My questions

is this in line with running times you observed or 10hrs /file is a fluke (== too slow)?
would it be possible to speed up processing?

Thank you

Darek Kedra

mcgml commented 3 years ago

Hi @darked89

Thanks for the feedback. 10 hours does seem slow. It typically take 2-4 hours for densely imputed data (10M variants). How many variants are you mapping? The process is very I/O intense so fast storage will vastly improve performance.

Thanks Matt

darked89 commented 3 years ago

Dear Matt,

the TSV input has +16M rows/positions. Compressing it with bgzip and indexing with tabix unfortunatelly did not improve the time needed to process it:

time ./run_amoeb_finn_whole_genome_dbsnp155_MT_gz-input.sh > error.amoeb.02.log 2>&1
real    588m40.557s

using (use real PID of the python3 running the gwas2vcf)

py-spy top --pid 123456

I got that >95% of the time the program spends executing update_dbsnp (gwas.py:94). No idea if i.e newer pysam (not that it is an easy change to make) will improve the speed.

Best,

DK

mcgml commented 3 years ago

Thanks @darked89. I will look at implementing cyvcf2 for reading/writing VCF files which is around 7x faster in the published example.

In the meantime, you could speed up the conversion by splitting your GWAS into chunks and running each chunk concurrently.

darked89 commented 3 years ago

Deari Matt,

for the large set of summary stats i.e. from Finngen the quick hack to try is to reduce the size of the dbSNPs VCF file by creating a customized, mini-dbSNP VCF by selecting dbSNP entries specific for the given biobank. I have to recheck that the results obtained in this way are identical to the ones obtained using the whole dbSNP.

Can you think about any reason they may differ?

Best,

DK

mcgml commented 3 years ago

Hi @darked89

Should give the same results. I doubt the performance would improve though since the SNP lookup is using tabix rather than reading the whole dbsnp VCF.

Do you observe a performance improvement?

Thanks Matt

darked89 commented 3 years ago

Hi Matt,

I have used just a small subset of the original Finngen (chromosome 22).

# VCF sizes
 1.4G   dbsnp155_finngen_only_all_positions.vcf.gz
25G     dbSNP_155.GRCh38.names_fixed.vcf.gz

# time
'dbsnp': '/genome/dbsnp155_finngen_only_all_positions.vcf.gz': 109.69 secs
'dbsnp': '/genome/dbSNP_155.GRCh38.names_fixed.vcf.gz':  517.38 secs

Maybe there are some delays in accessing the drive in our setup, but it looks like the size of the VCF does affect the speed how fast one can query the indexed VCF file in some environments .

Hope it helps,

DK

mcgml commented 3 years ago

Thanks @darked89! That's a huge difference in performance. I will investigate.

If you would like to try, I created a new branch cyvcf2 which uses cyvcf2 for dbsnp look ups. The package in written in cython so should offer improved query speeds.

Thanks Matt

darked89 commented 3 years ago

Hello,

I was unsure if there may be some silly snafu's somehow giving me a corrupted /not all the records result VCF really, really fast, so I did compute md5sums on the non-header portions of the outputs (whole_dbSNP vs Finngen_subset_ofdbSNP):

pigz --stdout --decompress 20210705_chr22_dbsnp-subset_sarco.vcf.gz | rg -v '^#' > 20210705_chr22_dbsnp-subset_sarco.vcf.no_header
pigz --stdout --decompress 20210705_chr22_dbsnp-whole_sarco.vcf.gz | rg -v '^#' > 20210705_chr22_dbsnp-whole_sarco.vcf.no_header

md5sum 20210705_chr22_dbsnp-*no_header
94dc99a7ec9265ec8df907ad30f9c39b  20210705_chr22_dbsnp-subset_sarco.vcf.no_header
94dc99a7ec9265ec8df907ad30f9c39b  20210705_chr22_dbsnp-whole_sarco.vcf.no_header

and the headers have a different commands:

##Gwas2VCF_command=--data /data/22_chrom_finngen_R5_D3_SARCOIDOSIS.tsv --json /data/finngen.json --id sarco01 --ref /genome/hs38p13.ens_pa.fa --dbsnp /genome/dbsnp155_finngen_only_all
_positions.vcf.gz --out /data/20210705_chr22_dbsnp-subset_sarco.vcf --cohort_controls 215712 --cohort_cases 2046; 1.3.1
##file_date=2021-07-05T11:46:55.360262

##Gwas2VCF_command=--data /data/22_chrom_finngen_R5_D3_SARCOIDOSIS.tsv --json /data/finngen.json --id sarco01 --ref /genome/hs38p13.ens_pa.fa --dbsnp /genome/dbSNP_155.GRCh38.names_fi
xed.vcf.gz --out /data/20210705_chr22_dbsnp-whole_sarco.vcf --cohort_controls 215712 --cohort_cases 2046; 1.3.1
##file_date=2021-07-05T11:58:52.434064

so It does not look like some late night error produced results too good to be true. Or so I hope.. ;)

DK

MRCIEU / gwas2vcf

benchmarks and speedups #56