EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
130 stars 39 forks source link

Intermittent 'number of samples' error #190

Closed EvanTheB closed 4 years ago

EvanTheB commented 5 years ago

4561:Error: The number of samples must match those listed in the header line. This occurs 1 time(s), first time in line 1070259.

I am getting this error intermittently, on the same file. Sometimes no error. Sometimes an error (on a different line).

VCF is gzipped.

Version is vcf_validator_linux from this github release page:

$ vcf_validator --version
vcf_validator version 0.9

I have visually inspected, and cannot identify a reason for the error.

This script gives me only 1 value, the expected number of samples:

zcat ../reheader/MGRB.phase2.SNPtier12.match.vqsr.minrep.WGStier12.unrelated.nocancer.over70.21.vcf.gz | awk '/^[^#]/{print NF}' | uniq

Any ideas or further tests I can perform?

EDIT: As one further test, I ran vcf_validator under valgrind, there were thousands of errors.

jmmut commented 5 years ago

Thanks for the report. We have seen this a couple times before with big files, but couldn't fix it because we couldn't reproduce it properly. We only have a vague idea where the problem could be. Maybe the gzip decompression, maybe the parser, maybe both.

How big is your file? does it really have 1070259 lines or more? How did you run the validator? vcf_validator -i your_file.vcf.gz? or zcat your_file.vcf.gz | vcf_validator?

Regarding the valgrind errors, it's possible that they are related to this issue, although most of them are errors in pthread and boost which are outside of our control. (Some time ago we did go through valgrind errors and fixed most of them, you can see that "All heap blocks were freed -- no leaks are possible").

Is your VCF open access data? I'm thinking if you could send your file to us. I can provide an FTP where you can upload it if it's big.

If nothing else works, you could always split your file and copy the header to each part. Sorry for the inconvenience of this bug, there is always something more urgent to fix.

EvanTheB commented 5 years ago

The file is 100Gb, it is not open access sorry.

I am seeing this pretty consistently on files as small as 100Mb though.

The valgrind errors I am seeing are branches based on uninitialised memory. I would suggest that if your libraries are invoking undefined behaviour, you stop using them.

jmmut commented 5 years ago

Then, what about sharing one of those 100Mb VCFs? Any chance that you could anonymise it? Or at least describe in general terms the features of those VCFs? Like number of samples, fields used in INFO and FORMAT, if it's mostly SNPs or mostly SVs, etc, I don't really know what we should look for.

I am the first one that would like to fix this bug, but without a VCF that can reproduce it, trying to fix it is a rabbit hole with no guarantees. We saw this problem with some of our VCFs, but they almost never fail. We tried to create a big mock VCF but it doesn't fail. If we can not reproduce it, we can not fix it.


Regarding the branches on uninitialised data, stopping using the libraries (glibc, STL and Boost) is not a real option. The errors pointed by valgrind are either before or after main (glibc and STL), or about logging or CLI argument parsing (Boost), which is very very unlikely to be related to this bug. Some of them are even about pthread in glibc, and we are not using threads!

Without being sure that a specific problem is there, there's no point going through the pain of replacing or reimplementing some of the most important and time-hardened libraries of C++.

jmmut commented 5 years ago

Just by accident we discovered that bgzip dcompression might not be done properly, missing small parts of the file. This is likely to be the cause of these family of strange errors like "number of samples doesn't match" or "id is not alphanumeric".

The solution while we work on fixing bgzip decompression is to use the validator as zcat my.vcf.gz | vcf_validator. Also, although it is less convenient on big files, decompressing beforehand also works.

tcezard commented 4 years ago

This seems to be fixed: closing