EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
130 stars 39 forks source link

display warning if the input VCF is compressed #111

Closed srbcheema1 closed 6 years ago

srbcheema1 commented 6 years ago

This PR adds the feature to display warning on providing compressed VCF file as STDIN or using -i parameter. In reference to the issue #104

srbcheema1 commented 6 years ago

@cyenyxe @jmmut
Added a function to check if a file is compressed or not This function prints a warning message if the file is compressed. Please review

srbcheema1 commented 6 years ago

I have made the requested the changes to validator.cpp file and also extracted strings to string_constants.hpp file. Tests are yet to be made. Have a look at the changes. Once the changes get approved I will squash the commits into one.

srbcheema1 commented 6 years ago

I have made the requested changes. Please have a look.

srbcheema1 commented 6 years ago

@jmmut I want to discuss something regarding #107 issue . there are tools in boost iostream library to read bz2 and gzip compression formats . what about returning compression type from is_compressed_file and this would help us in reading the compressed files .

jmmut commented 6 years ago

I wouldn't worry about #107. For now, just make simple code that does one thing, and when we start that issue, we'll deal with any needed refactoring.

Really. One of the biggest lessons I learnt in recent years is how important is to make simple code. One problem at a time. Usually trying to be too clever makes things worse.

Let's finish this issue first, I'd say only the tests are left. Let me know if you have any doubt on how to design the tests.

srbcheema1 commented 6 years ago

I can figure out that i need to add testcase to test/vcf/parser_v*_.test.cpp file. Am I going in right direction ?

jmmut commented 6 years ago

There are 2 types of tests we can do here.

the integration ones are almost done. Just rename the compressed files you added from test/input_files/v4.3/compressed_files/failed_body_alt_000.vcf.* to test/input_files/v4.3/failed/compressed.vcf.*. that way, the test/vcf/parser_v43_test.cpp will try to do a whole validation on those files, and expect them to fail. You don't need to modify test/vcf/parser_v43_test.cpp.

the unit tests should use the functions and check that they return true and false appropiately. For that, you'll have to move the declarations to validator.hpp (I didn't realise they were in the cpp until now). You can add a simple test file, taking as example ploidy_test.cpp, for instance. Testing the extension function is easy. To test the magic number, you might need to write into a stringstream those bytes, but should be easy as well.

srbcheema1 commented 6 years ago

I have made a file tests/vcf/compressed_files_test.cpp . In that i have added tests to test is_compressed_magic_num() and is_compressed_extension I have renamed those compressed files . I haven't yet moved those files into failed folder. By keeping them in seperate folder it was easy for me to write tests for files that return true on is_compressed_*()

srbcheema1 commented 6 years ago

I have added the tests for non compressed files also. I have also added a special test to check working of seekg().

srbcheema1 commented 6 years ago

@jmmut tell me if I should squash the commits. I will squash similar commits into one once you approve the changes.

srbcheema1 commented 6 years ago

@jmmut there was a minor issue with seekg . It was behaving abnormally for files with number of characters less than 9 characters. I fixed it as explained in https://stackoverflow.com/questions/48979934/seekg-not-behaving-as-expected using input.clear() please have a look at it. I am looking forward to get this PR merged.

jmmut commented 6 years ago

hm, that is exactly the issue I was afraid of. what's more, it's not fixed. if you try this:

./builds/debug/bin/vcf_validator -i test/input_files/v4.3/passed/complexfile_passed_000.vcf

it will say it's valid, but if you pass it through a pipe instead, it will say it's not valid:

cat test/input_files/v4.3/passed/complexfile_passed_000.vcf | ./builds/debug/bin/vcf_validator

I suggest leaving the approach of seekg and using a copy of the first line as done in https://github.com/EBIvariation/vcf-validator/blob/develop/src/vcf/validator.cpp#L153. there you can see that the first line is read, and it's used for detecting the version in line 145, and to start the validation in line 153.

srbcheema1 commented 6 years ago

tests were passing. I will check that out. I too was thinking to use first line https://github.com/EBIvariation/vcf-validator/pull/111#discussion_r170569585 earlier when you mentioned about that in https://github.com/EBIvariation/vcf-validator/pull/111#discussion_r166876466 I searched such function in validator.hpp(maybe you made typo while writingvalidator.cpp) but was unable to find one. by that time i got approach of seekg as i have used it in filestreams but i was doubtful for using it in std::cin .

jmmut commented 6 years ago

yes, that's what I meant, sorry for not being clear enough. I'll try to be more specific in the future, and please ask any doubt you have.

srbcheema1 commented 6 years ago

I have tested it using these commands

./build/bin/vcf_validator < test/input_files/v4.3/passed/complexfile_passed_000.vcf

and this one

./build/bin/vcf_validator -i test/input_files/v4.3/passed/complexfile_passed_000.vcf

I use command < filename more than cat filename| command

maybe thats why these things went unnoticed . moreover our testcases are not capable to check this bug out .

jmmut commented 6 years ago

Didn't know cat file | and < file could behave so differently. I used cat file | because if the file is compressed the way to use it is simpler and more similar, as zcat file.gz |.

Agree that the tests (as they are right now) can not check this. This is another reason to go away from stream handling.

srbcheema1 commented 6 years ago

hmm strange it is. Till now i thought both the ways similar to giving standard input

srbcheema1 commented 6 years ago

@jmmut I have made the requested changes. Please have a look at it.