Zhangliubin / GBC

GBC (short for GenoType Blocking Compressor) is a blocking compressor for genotype data, which aims at creating a unified and flexible structure-GenoType Block (GTB) for genotype data in the variant call format (VCF) files.
http://pmglab.top/gbc/
BSD 3-Clause "New" or "Revised" License
2 stars 2 forks source link

LD calculation produced empty files #1

Open jerome-f opened 1 year ago

jerome-f commented 1 year ago

Hi

This is a great tool and the compression levels are impressive. I am trying to calculate LD matrix (exhaustive one). with the following settings

java -Xms16g -Xmx16g -Xss4m -XX:+UnlockDiagnosticVMOptions -XX:GCLockerRetryAllocationCount=500 -jar gbc.jar ld XXXX.gtb --o-bgz -t 4 --gene-r2 -o XXXX.bgz --maf 0.001 --min-r2 0.0001 -bp 5000000 -y 

But the program quits with out producing a detailed error message and empty bgz file


10:01:48.843 [main] INFO GBC - 
LDTask {
    inputFile: XXXX.gtb
    outputFile: XXXX.bgz
    threads: 4
    subjects: <all subjects>
    LD Model: Genotype LD (Pearson genotypic correlation of variants)
    filter: MAF >= 0.00100000, R^2 >= 0.000100000
    window size: 5000000 bp
}
10:09:50.939 [main] INFO GBC - Total Processing time: 482.089 s; LD file size: 77 B
Chromosome 4 correlation done  Duration: 8 minutes

I am guessing the program runs to OOM or zero division error somewhere. Can you help me with this.

Zhangliubin commented 1 year ago

Thank you for your feedback. Based on the limited information available, I have observed that your LD window is set quite large (5,000,000 bp), with minimal filtering of the variants (MAF >= 0.00100000, R^2 >= 0.000100000). If the file is of substantial size, with numerous samples and dense variants, this could result in a considerable amount of data to be calculated, posing a significant challenge to memory overhead.

Perhaps the GBC-2.0 version could address your issue, as it employs a more memory-efficient model, with our testing indicating a more than ten-fold increase in LD calculation speed compared to version 1.2. Currently, our team is conducting the final functional tests, and we expect to release it next week (31st March 2023). I will inform you of the results after testing with 1000GP3.

jerome-f commented 1 year ago

Hi Zhang,

Thanks for the reply. I am doing an exhaustive calculations to seek independent blocks. So it is an overkill. The sample size is the same as 1000G (~500). Keep me posted for the GBC-2.0 I would be keen on testing.

Zhangliubin commented 1 year ago

Thank you for your patience. This issue in GBC has been resolved with memory control optimization and output format switching. You can download the latest version: gbc-stable-1.0.jar. The updated documentation will be available at http://pmglab.top/gbc within the next two days. For now, you can use java -jar gbc-stable-1.0.jar -h to view the usage documentation.

I tested with the VCF file of chromosome 4 (1000GP3-EAS) downloaded from https://pmglab.top/genotypes/, which has 504 samples and is similar to your test dataset, using the following commands:

# Convert VCF file to GTB file
java -jar gbc-stable-1.0.jar vcf2gtb ./1kg.phase3.v5.shapeit2.eas.hg19.chr4.vcf.gz 

Then, run the command with the following parameters:

java -jar gbc-stable-1.0.jar ld ./1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb --min-r2 0.0001 --maf 0.001 -bp 100000 -t 4

The output looks like this:

2023-03-31 04:56:50 INFO  [main] GBC - Command Line Interface Check that the input file is ordered according to the coordinates...
2023-03-31 04:56:50 INFO  [main] GBC - Command Line Interface 
Calculate Genotype LD (Pearson genotypic correlation of variants)
    GTB File Name: /DATA/1kg.phase3.v5.shapeit2.eas.hg19.chr4.gtb
    GTB File Size: 31.340 MB
    Dimension of Genotype: 5732585 variants and 504 subjects
    Output File Name: /DATA/1kg.phase3.v5.shapeit2.eas.hg19.chr4.geno.gz
    Number of Parallel Threads: 1
    LD Method: Genotype LD (Pearson genotyp variants)
    Window Size: 100000 bp
    Filter: MAF >= 0.00100000, R^2 >= 0.000100000
> Calculated: 1986532 variants / 1986532 variants (100 %); Speed: 15460.2 variants/s
2023-03-31 05:00:45 INFO  [main] GBC - Command Line Interface 467524215 pairs of 1986532 variants have been calculated
2023-03-31 05:00:45 INFO main] GBC - Command Line Interface Total Processing time: 235.179 s; Output size: 3.217 GB

Please note that:

  1. gbc-stable-1.0.jar is not compatible with the previous versions, and you need to reconstruct the GTB archive for your files;
  2. In the latest version of GBC, the granularity of parallel computation for LD has been increased from GTB node blocks to chromosomes. Therefore, parallel computation does not work for a single chromosome input file.
  3. Since my input is a whole-genome file, I reduced the window size (-bp). As you can see, it still performed a considerable number of LD calculations (467524215 pairs of 1986532 variants).

Finally, the paper of GBC has been accepted, and we have reset the version of the software to 1.0 to avoid confusion for our users. The previous version is now referred to as "v1.2, version for publication."

jerome-f commented 1 year ago

Thanks this is great. I appreciate this very much.