dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
137 stars 36 forks source link

cannot join g.vcf files using glnexus #264

Closed kostasgalexiou closed 3 years ago

kostasgalexiou commented 3 years ago

Hi,

When I try to merge gvcf files generated with deepvariant, using glnexus, I get the following error:

[W::bgzf_read_block] EOF marker is absent. The input is probably truncated Error: BCF read error

Below is the command that I am using:

./glnexus_cli \ --config DeepVariantWGS \ SQU1.sort.pcr_rem.RG.kept.g.vcf.gz SQU3.sort.pcr_rem.RG.kept.g.vcf.gz | \ bcftools view - | \ bgzip -c > SQU13.vcf.gz

My gvcf files look normal and the log file for gvcf generation did not give any errors.

Any help will be appreciated.

Thanks

mlin commented 3 years ago

The mentioned "EOF marker" consists of the following 28 hex bytes. Can you try tail -c 28 SQU1.sort.pcr_rem.RG.kept.g.vcf.gz | xxd -p (and your other file too) to confirm if it's present or not:

1f8b08040000000000ff0600424302001b0003000000000000000000

This will just adjudicate whether the message displayed is true or not.

mlin commented 3 years ago

BTW, please also check the log for other error messages preceding that one. It's possible the one you showed actually comes from your bcftools view - command complaining because GLnexus' output is truncated due to some other error.

kostasgalexiou commented 3 years ago

The mentioned "EOF marker" consists of the following 28 hex bytes. Can you try tail -c 28 SQU1.sort.pcr_rem.RG.kept.g.vcf.gz | xxd -p (and your other file too) to confirm if it's present or not:

1f8b08040000000000ff0600424302001b0003000000000000000000

This will just adjudicate whether the message displayed is true or not.

Dear @mlin,

Thanks for you fasta reply. I have checked all my gvcf files and all of them contain the EOF marker

mlin commented 3 years ago

Please show more of the log leading up to the error message

kostasgalexiou commented 3 years ago

BTW, please also check the log for other error messages preceding that one. It's possible the one you showed actually comes from your bcftools view - command complaining because GLnexus' output is truncated due to some other error.

kostasgalexiou commented 3 years ago

Please show more of the log leading up to the error message

./glnexus_cli --config DeepVariantWGS SQU1.sort.pcr_rem.RG.kept.g.vcf.gz SQU3.sort.pcr_rem.RG.kept.g.vcf.gz | bcftools view - | bgzip -c > SQU13.vcf.gz [29743] [2021-07-03 17:58:32.006] [GLnexus] [info] glnexus_cli release v1.2.6-0-g4d057dc Wed May 20 22:14:49 2020 [29743] [2021-07-03 17:58:32.006] [GLnexus] [warning] jemalloc absent, which will impede performance with high thread counts. See https://github.com/dnanexus-rnd/GLnexus/wiki/Performance [29743] [2021-07-03 17:58:32.006] [GLnexus] [info] Loading config preset DeepVariantWGS [29743] [2021-07-03 17:58:32.011] [GLnexus] [info] config: unifier_config: drop_filtered: false min_allele_copy_number: 1 min_AQ1: 10 min_AQ2: 10 min_GQ: 0 max_alleles_per_site: 32 monoallelic_sites_for_lost_alleles: true preference: common genotyper_config: revise_genotypes: true min_assumed_allele_frequency: 9.99999975e-05 required_dp: 0 allow_partial_data: true allele_dp_format: AD ref_dp_format: MIN_DP output_residuals: false more_PL: true squeeze: false trim_uncalled_alleles: true output_format: BCF liftover_fields:

kostasgalexiou commented 3 years ago

Hi @mlin,

Sorry for closing the issue. Is still remains open.

Regards

mlin commented 3 years ago

Please try writing the glnexus_cli output to disk (without the subsequent shell pipeline) just to see what happens then or if there's a different error message. Something appears to be stopping it prematurely in the genotyping stage (which comes after it will have read in the complete GVCF inputs -- I'm fairly certain the "EOF marker is absent" is coming from bcftools in the pipeline). The most common cause of that is the OS out-of-memory killer, but there's typically a message like "Killed" in that case so I'm not sure if that's the case here.

kostasgalexiou commented 3 years ago

Hi @mlin,

Looks like it is a memory failure...

[16357] [2021-07-05 09:34:42.832] [GLnexus] [info] glnexus_cli release v1.2.6-0-g4d057dc Wed May 20 22:14:49 2020 [16357] [2021-07-05 09:34:42.832] [GLnexus] [warning] jemalloc absent, which will impede performance with high thread counts. See https://github.com/dnanexus-rnd/GLnexus/wiki/Performance [16357] [2021-07-05 09:34:42.832] [GLnexus] [info] Loading config preset DeepVariantWGS [16357] [2021-07-05 09:34:43.050] [GLnexus] [info] config: unifier_config: drop_filtered: false min_allele_copy_number: 1 min_AQ1: 10 min_AQ2: 10 min_GQ: 0 max_alleles_per_site: 32 monoallelic_sites_for_lost_alleles: true preference: common genotyper_config: revise_genotypes: true min_assumed_allele_frequency: 9.99999975e-05 required_dp: 0 allow_partial_data: true allele_dp_format: AD ref_dp_format: MIN_DP output_residuals: false more_PL: true squeeze: false trim_uncalled_alleles: true output_format: BCF liftover_fields:

kostasgalexiou commented 3 years ago

I suppose I can run the analysis per chromosome.

mlin commented 3 years ago

Try setting the command-line option --mem-gbytes 1 or something else conservative. It's a trade-off where (at scale) it will be faster with more memory at its disposal, short (obviously) of getting killed.

kostasgalexiou commented 3 years ago

It has worked with the --mem-gbytes 1 argument! Thanks a lot for the help and your prompt replies!