Sage-Bionetworks / Genie

Validation and processing of GENIE files
https://genie.synapse.org/
MIT License
12 stars 9 forks source link

[GEN-809] Validate allele columns #539

Closed rxu17 closed 11 months ago

rxu17 commented 11 months ago

Purpose: There is no current allele validation so this PR adds it for the VCF and MAF file formats. Each format has its own list of accepted alleles.

Changes: I added two general allele validation and allele validation message functions to genie/validate.py:

so that they can be more general and used by both vcf and maf file formats (and any relevant future file formats). I separated out the retrieval of the rows with the invalid allele values and the retrieval of the error message to make it easier for testing and also easier to implement row-based validation in the future if we would like (or keep our more high level validation).

Special Considerations: NA handling: Based on our discussion, we will wait to handle NAs (default solution right now is to flag NAs as invalid). Because I'm using regex to match on the allele values, I would run into an issue with any NAs/missingness in the columns. I think that it's currently enforced that we can't have emptiness for the allele columns in these file formats but I added this parameter allow_na to flag any NAs as invalid otherwise the code will break when using pandas.str.match. We also have to have special handling to validate a column with all NA values in this function as well...

I talked more about the caveats of handling NAs here

Testing: