Data Preprocessing QC Script

seankim658 commented 1 month ago

We need a preprocessing script that does some basic quality checks on our incoming datasets. @Reeya123 we will have to have a call to go over the requirements. Right now my list of features that would be helpful:

[x] make sure the correct case formatting is used for various fields
- [x] biomarker field first word lowercase
- [x] best_biomarker_role lower case (rows with multiple roles with be formatted like role1;role2 so you'll have to split before checking)
- [x] specimen (if present) lowercase
- [x] evidence_source resource (before colon) should be title case (e.x. in python that would be string.title())
[x] some basic temp ID checking (have a flag to tell the script whether to expect panel biomarkers or not in the dataset)
- [x] if no panel biomarkers, all rows with the same id field should have the exact same values for the following fields: biomarker, assessed_biomarker_entity, assessed_biomarker_entity_id, assessed_entity_type, condition, and condition_id
[x] check expected data formats
- [x] assessed_biomarker_entity_id should be in the format of resource:id
- [x] condition_id should be in the same format as above
[x] ensure required fields are present: id, biomarker, assessed_biomarker_entity, assessed_biomarker_entity_id, assessed_entity_type, best_biomarker_role
- If both exposure_agent and exposure_agent_id are not present, then condition and condition_id are required and vise versa
[x] check biomarker and assessed_entity_type fields against a list of standardized terminology: risk, diagnostic, monitoring, prognostic, predictive, response, or safety (remember the role field could have multiple rows split by a semicolon) add json config file for standardized terminology and flagging
[x] Flag duplicate rows

Right now this script should not make any changes to the data, it should just generate a summarization report (maybe text, log, or markdown file) and flag rows that had a potential issue. Also, do not use pandas. A lot of the data files we get are very large and you will run into memory issues using pandas. Pandas will not scale well. Use the python csv library to read the file contents line by line (you can look into polars potentially but csv will be a lot more straightforward and simple id imagine).

@DaniallMasood add anything else you want or that I missed. And you can join the call as well if you want, you might have other things you want or might have other perspective on the data since you do most of the QC.

Instead of command line options you could instead define a JSON format for input, for example:

{
  "terminology": {
    "best_biomarker_role": ["risk", "diagnostic", "monitoring", "prognostic", "predictive", "response", "safety"],
      "assessed_entity_type": [
      "DNA",
      "protein complex",
      "lipoprotein",
      "protein",
      "electrolyte",
      "cell",
      "lipid",
      "oligosaccharide",
      "miRNA",
      "gene",
      "metabolite",
      "carbohydrate",
      "glycan",
      "chemical element",
      "amino acid",
      "peptide",
      "RNA"
  ] 
  },
  "flags" {
    "best_biomarker_role": ["predictive"],
    "assessed_entity_type": ["peptide"]
  }
}

Reeya123 commented 1 month ago

This is done. script generates similar output:

seankim658 commented 3 weeks ago

@DaniallMasood try this out while you're doing your QC and see how it goes

clinical-biomarkers / biomarker-issue-repo

Data Preprocessing QC Script #32