EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

Added single-line duplicate values check for ID #63

Closed Anishka0107 closed 7 years ago

Anishka0107 commented 7 years ago

This PR aims to partly fix the issue https://github.com/EBIvariation/vcf-validator/issues/35 According to VCF 4.3 specifications, ID cannot have duplicate values, so added a new check for duplicate values. If any duplicated IDs are found in a single line (semicolon separated list), then an error is thrown and file is declared as invalid.

Anishka0107 commented 7 years ago

@jmmut Thanks for the review sir. I have made a few changes as you suggested, but now the Travis build is failing and I am unable to understand why that is happening. The test was successful locally.

jorizci commented 7 years ago

It's failing because...

/home/travis/build/EBIvariation/vcf-validator/src/vcf/record.cpp:125:21: error: indirection requires pointer operand ('const std::basic_string' invalid) counter[id]++; ^~~ /home/travis/build/EBIvariation/vcf-validator/src/vcf/record.cpp:126:25: error: indirection requires pointer operand ('const std::basic_string' invalid) if (counter[id] >= 2) {

jmmut C++ is fresher than mine, but I think auto & should return you a std:string directly in id and maybe you don't need the *.

jmmut commented 7 years ago

oh, yes, id is not a pointer anymore, so removing the * fixes it. If your tests passed, maybe you forgot to include in the commit all the changes?

Anishka0107 commented 7 years ago

Thanks a lot for the help @jorizci @jmmut Jose sir, actually I forgot to include all changes in my commit! Will be careful from the next time about silly mistakes :) Added a new commit. What should I do next?

jorizci commented 7 years ago

@Anishka0107 Just a little tip for any git-related project. If you have a situation where part of the changes haven't been commited (It just happens) you can use the commit ammend functionality (In command line you do a git add files where missing changes are and then use the git commit --amend and do a push force to your branch) that way the historic stays cleaner and you can avoid having a lot of commits with "missing ..." Anyway just wait for @jmmut to check everything , approve the pr and merge the code the repository.

jmmut commented 7 years ago

Don't worry, I guessed that because I've done that dozens of times. I actually forgot to ask you this morning to add a test for the new feature, sorry.

Under the test/vcf folder we have the tests. Some of them are unitary (for instance record_test.cpp) and other are integration tests (such as parser_v43_test.cpp).

For this PR just add another SECTION clause right after this one https://github.com/EBIvariation/vcf-validator/blob/develop/test/vcf/record_test.cpp#L166, which is very similar, and change the IDs so that it checks the new feature. We are using Catch (tutorial, assertions)

Anishka0107 commented 7 years ago

@jorizci That's really cool. Thanks for the help:) @jmmut Working on it sir.

Anishka0107 commented 7 years ago

@jmmut I have added the test.

cyenyxe commented 7 years ago

There is only one more test to implement, in order to check that a whole application run flags this kind of error. If you go to test/input_files/ you will find one subfolder per version of the specification, and inside that one, a failed subfolder containing test VCF files.

These are named according to the section and field they validate (meta_alt, body_chrom, body_id). In this case you need to do the following:

  1. Create a new file failed_body_id_003.vcf inside each test/input_files/v4.x/failed folder. You can use another file in the same "series" as reference
  2. Make the ID field have a duplicate value
  3. Indicate what is wrong in the CauseOfFailure message
  4. The test suite will take care of the rest
Anishka0107 commented 7 years ago

@cyenyxe I have added 3 test files with duplicate IDs.

cyenyxe commented 7 years ago

Looks great, thanks! Other than the comment below I think this is ready to merge 😄

Taking previous experience into account, most of the times the ID field contains only one value. Could you please add a condition to avoid creating the std::map and running the validation in that case?

Anishka0107 commented 7 years ago

@cyenyxe Added the condition ma'am. For it to look simple, I am checking for the size of the vector, is that okay? Or should I use count algorithm of STL or something else.

cyenyxe commented 7 years ago

I think just checking the size is the best / simplest solution.