Purpose: This is a draft PR. This PR allows in the "NaN", "nan" and "NA" strings for allele columns in the vcf and maf datasets because these are valid allele combinations.
Changes:
_convert_values_to_na in genie/transform.py- new function to convert all occurrences of values in a dataframe to NA, this is a helper function that is used in the _get_dataframe methods of the vcf.py and maf.py files
I decided to add the allowing in of the "NaN", "nan" and "NA" strings in this method because we need this to occur for validation and processing, and this is the method that is used in both. I also had to do some special handling for maf files because they can allow in case insensitive column names. The in-depth reasoning behind allowing in "NaN", "nan" and "NA" strings for ALL of the dataset, then converting them back to NA in non-allele columns can be found in the comments of the JIRA ticket. Summary of reasoning is that it is difficult to useread_csv to only convert specific columns, especially since it already has a bunch of arguments for NA specific handling, and order of operations comes into play here. See the docs for more details.
Testing:
[X] Updated and ran pytests locally
[X] Updated maf and vcf test files in test synapse project to include NaN, nan and NA as valid allele values and ran genie pipeline all the way through locally (integration test)
Purpose: This is a draft PR. This PR allows in the "NaN", "nan" and "NA" strings for allele columns in the vcf and maf datasets because these are valid allele combinations.
Changes:
_get_dataframe
methods of thevcf.py
andmaf.py
filesread_csv
to only convert specific columns, especially since it already has a bunch of arguments for NA specific handling, and order of operations comes into play here. See the docs for more details.Testing: