so that they can be more general and used by both vcf and maf file formats (and any relevant future file formats). I separated out the retrieval of the rows with the invalid allele values and the retrieval of the error message to make it easier for testing and also easier to implement row-based validation in the future if we would like (or keep our more high level validation).
Special Considerations:
NA handling: Based on our discussion, we will wait to handle NAs (default solution right now is to flag NAs as invalid). Because I'm using regex to match on the allele values, I would run into an issue with any NAs/missingness in the columns. I think that it's currently enforced that we can't have emptiness for the allele columns in these file formats but I added this parameter allow_na to flag any NAs as invalid otherwise the code will break when using pandas.str.match. We also have to have special handling to validate a column with all NA values in this function as well...
I talked more about the caveats of handling NAs here
Testing:
Unit tests were written to test the general allele validation functions.
Maf and Vcf classes' tests were adjusted to account for the new allele validation.
Purpose: There is no current allele validation so this PR adds it for the VCF and MAF file formats. Each format has its own list of accepted alleles.
Changes: I added two general allele validation and allele validation message functions to
genie/validate.py
:so that they can be more general and used by both vcf and maf file formats (and any relevant future file formats). I separated out the retrieval of the rows with the invalid allele values and the retrieval of the error message to make it easier for testing and also easier to implement row-based validation in the future if we would like (or keep our more high level validation).
Special Considerations: NA handling: Based on our discussion, we will wait to handle NAs (default solution right now is to flag NAs as invalid). Because I'm using regex to match on the allele values, I would run into an issue with any NAs/missingness in the columns. I think that it's currently enforced that we can't have emptiness for the allele columns in these file formats but I added this parameter allow_na to flag any NAs as invalid otherwise the code will break when using
pandas.str.match
. We also have to have special handling to validate a column with all NA values in this function as well...I talked more about the caveats of handling NAs here
Testing: