clinical-biomarkers / biomarker-issue-repo

Issues repo for the biomarker team.
0 stars 0 forks source link

Data Preprocessing QC Script #32

Open seankim658 opened 1 month ago

seankim658 commented 1 month ago

We need a preprocessing script that does some basic quality checks on our incoming datasets. @Reeya123 we will have to have a call to go over the requirements. Right now my list of features that would be helpful:

Right now this script should not make any changes to the data, it should just generate a summarization report (maybe text, log, or markdown file) and flag rows that had a potential issue. Also, do not use pandas. A lot of the data files we get are very large and you will run into memory issues using pandas. Pandas will not scale well. Use the python csv library to read the file contents line by line (you can look into polars potentially but csv will be a lot more straightforward and simple id imagine).

@DaniallMasood add anything else you want or that I missed. And you can join the call as well if you want, you might have other things you want or might have other perspective on the data since you do most of the QC.

Instead of command line options you could instead define a JSON format for input, for example:

{
  "terminology": {
    "best_biomarker_role": ["risk", "diagnostic", "monitoring", "prognostic", "predictive", "response", "safety"],
      "assessed_entity_type": [
      "DNA",
      "protein complex",
      "lipoprotein",
      "protein",
      "electrolyte",
      "cell",
      "lipid",
      "oligosaccharide",
      "miRNA",
      "gene",
      "metabolite",
      "carbohydrate",
      "glycan",
      "chemical element",
      "amino acid",
      "peptide",
      "RNA"
  ] 
  },
  "flags" {
    "best_biomarker_role": ["predictive"],
    "assessed_entity_type": ["peptide"]
  }
}
Reeya123 commented 1 month ago

This is done. script generates similar output: image

seankim658 commented 3 weeks ago

@DaniallMasood try this out while you're doing your QC and see how it goes