NCI-CGR / GwasQcPipeline

The CGR GWAS QC processing workflow.
https://nci-cgr.github.io/GwasQcPipeline/
MIT License
0 stars 3 forks source link

Fixes #240: convert alt case/control labels to Unknown #267

Closed jaamarks closed 8 months ago

jaamarks commented 8 months ago

Convert all samples' case/control status to "Unknown" when their label is not in [Case, Control, QC, Unknown].

jaamarks commented 8 months ago

Should create a unit test for this.

jaamarks commented 8 months ago

Should create a unit test for this.

For now, let's table the creation of the unit test for this.

In the data module cgr_gwas_qc/testing/data.py, the docstring explains that the TestData class holds a collection of very small test datasets sourced from the internet. These datasets are designed for testing specific functionalities such as file type conversion and upstream workflow components. It's important to note that these datasets are synthetic, so they will not work with many of the filtering steps and aren't compatible with various workflow parts.

Thus the qc_exclusion.py module isn't applied to synthetic data. To address this, creating a meaningful test suite would require identifying an entirely different and appropriate test dataset. Additionally, manual modification of some subjects to introduce non-accepted Case/Control statuses would be necessary. Subsequently, a comprehensive test suite would need to be authored, and the data submodule updated to incorporate this new test data.

Though not intractable, this seems wholly unpractical for creating a unit-test for this particular edge-case functionally at the moment.



Testing process

To validate the new functionality, we conducted the following testing:

  1. Test Run Details: Executed the workflow on a test dataset which would run completely and allow us to test the new functionality.

  2. Manual Modifications: After completing the workflow, then manually altered the files that were created: sample_level/sample_qc.csv and subject_level/subject_qc.csv.

  3. Focus Area: Specifically adjusted entries in the case_control column to ensure their accurate representation in the final report tables (in delivery/QC_Report.docx).

  4. Illustration of Data Changes: Here's a snapshot of the original and modified data:

original sample_qc.csv ``` Sample_ID,Group_By_Subject_ID,num_samples_per_subject,analytic_exclusion,num_analytic_exclusion,analytic_exclusion_reason,is_subject_representative,subject_dropped_from_study,case_control,is_internal_control,is_sample_exclusion,is_user_exclusion,is_missing_idats,is_missing_gtc,Call_Rate_Initial,Call_Rate_1,Call_Rate_2,is_cr1_filtered,is_cr2_filtered,is_call_rate_filtered,IdatIntensity,Contamination_Rate,is_contaminated,replicate_ids,is_discordant_replicate,expected_sex,predicted_sex,X_inbreeding_coefficient,is_sex_discordant,AFR,EUR,ASN,Ancestry,identifiler_needed,identifiler_reason G12-example,3,1,False,0,,True,False,Case,False,False,False,False,False,0.98011,0.91181,0.990971,False,False,False,,,,,,F,F,0.007621,False,0.0,1.0,0.0,European,False, D03-example,64,1,False,0,,True,False,Case,False,False,False,False,False,0.9111,0.9111,0.9992315,False,False,False,,,,,,F,F,0.1226,False,0.0,0.981,0.019,European,False, E03-example,14,1,False,0,,True,False,Unknown,False,False,False,False,False,0.991,0.9991,0.991,False,False,False,,,,,,F,F,-0.0111,False,0.0,0.911,0.051,European,False, F12-example,16,1,False,0,,True,False,Control,False,False,False,False,False,0.981,0.991,0.998531,False,False,False,,,,,,F,F,0.02431,False,0.0,1.0,0.0,European,False, F10-example,45,1,False,0,,True,False,Case,False,False,False,False,False,0.981,0.91,0.990101,False,False,False,,,,,False,F,F,0.01934,False,0.0,0.961,0.038,European,False, C02-example,47,1,False,0,,True,False,Case,False,False,False,False,False,0.990185,0.998555,0.991,False,False,False,,,,,False,F,F,0.09198,False,0.016,0.0,0.991,East_Asian,False, H07-example,756,1,False,0,,True,False,Case,False,False,False,False,False,0.990535,0.998914,0.9993183,False,False,False,,,,,False,F,F,0.04274,False,0.033,0.9163,0.004,European,False, D08-example,76,1,False,0,,True,False,Case,False,False,False,False,False,0.97437,0.98271,0.98542,False,False,False,,,,,False,F,F,0.03824,False,0.0,0.91999,0.084,European,False, E08-example,84,1,False,0,,True,False,Case,False,False,False,False,False,0.98969,0.998091,0.998727,False,False,False,,,,,,F,F,0.02293,False,0.0,0.981,0.041,European,False, ```
modified case_control in sample_qc.csv ``` Sample_ID,Group_By_Subject_ID,num_samples_per_subject,analytic_exclusion,num_analytic_exclusion,analytic_exclusion_reason,is_subject_representative,subject_dropped_from_study,case_control,is_internal_control,is_sample_exclusion,is_user_exclusion,is_missing_idats,is_missing_gtc,Call_Rate_Initial,Call_Rate_1,Call_Rate_2,is_cr1_filtered,is_cr2_filtered,is_call_rate_filtered,IdatIntensity,Contamination_Rate,is_contaminated,replicate_ids,is_discordant_replicate,expected_sex,predicted_sex,X_inbreeding_coefficient,is_sex_discordant,AFR,EUR,ASN,Ancestry,identifiler_needed,identifiler_reason G12-example,3,1,False,0,,True,False,unknown,False,False,False,False,False,0.98011,0.91181,0.990971,False,False,False,,,,,,F,F,0.007621,False,0.0,1.0,0.0,European,False, D03-example,64,1,False,0,,True,False,,False,False,False,False,False,0.9111,0.9111,0.9992315,False,False,False,,,,,,F,F,0.1226,False,0.0,0.981,0.019,European,False, E03-example,14,1,False,0,,True,False,UNKNOWN,False,False,False,False,False,0.991,0.9991,0.991,False,False,False,,,,,,F,F,-0.0111,False,0.0,0.911,0.051,European,False, F12-example,16,1,False,0,,True,False,unknown,False,False,False,False,False,0.981,0.991,0.998531,False,False,False,,,,,,F,F,0.02431,False,0.0,1.0,0.0,European,False, F10-example,45,1,False,0,,True,False,"",False,False,False,False,False,0.981,0.91,0.990101,False,False,False,,,,,False,F,F,0.01934,False,0.0,0.961,0.038,European,False, C02-example,47,1,False,0,,True,False,Case,False,False,False,False,False,0.990185,0.998555,0.991,False,False,False,,,,,False,F,F,0.09198,False,0.016,0.0,0.991,East_Asian,False, H07-example,756,1,False,0,,True,False,Case,False,False,False,False,False,0.990535,0.998914,0.9993183,False,False,False,,,,,False,F,F,0.04274,False,0.033,0.9163,0.004,European,False, D08-example,76,1,False,0,,True,False,Case,False,False,False,False,False,0.97437,0.98271,0.98542,False,False,False,,,,,False,F,F,0.03824,False,0.0,0.91999,0.084,European,False, E08-example,84,1,False,0,,True,False,Case,False,False,False,False,False,0.98969,0.998091,0.998727,False,False,False,,,,,,F,F,0.02293,False,0.0,0.981,0.041,European,False, ```
  1. Data Integrity Consideration:
    • It's important to note that the data in the snapshots above have been modified so the samples will be de-identifiable. In particular, we changed the IDs and numerical values across all example data.


This approach was taken to thoroughly test the functionality under various conditions – samples being labeled something other than [Case, Control, QC, Unknown].