We need a preprocessing script that does some basic quality checks on our incoming datasets. @Reeya123 we will have to have a call to go over the requirements. Right now my list of features that would be helpful:
[x] make sure the correct case formatting is used for various fields
[x] biomarker field first word lowercase
[x] best_biomarker_role lower case (rows with multiple roles with be formatted like role1;role2 so you'll have to split before checking)
[x] specimen (if present) lowercase
[x] evidence_source resource (before colon) should be title case (e.x. in python that would be string.title())
[x] some basic temp ID checking (have a flag to tell the script whether to expect panel biomarkers or not in the dataset)
[x] if no panel biomarkers, all rows with the same id field should have the exact same values for the following fields: biomarker, assessed_biomarker_entity, assessed_biomarker_entity_id, assessed_entity_type, condition, and condition_id
[x] check expected data formats
[x] assessed_biomarker_entity_id should be in the format of resource:id
[x] condition_id should be in the same format as above
If both exposure_agent and exposure_agent_id are not present, then condition and condition_id are required and vise versa
[x] check biomarker and assessed_entity_type fields against a list of standardized terminology: risk, diagnostic, monitoring, prognostic, predictive, response, or safety (remember the role field could have multiple rows split by a semicolon) add json config file for standardized terminology and flagging
[x] Flag duplicate rows
Right now this script should not make any changes to the data, it should just generate a summarization report (maybe text, log, or markdown file) and flag rows that had a potential issue. Also, do not use pandas. A lot of the data files we get are very large and you will run into memory issues using pandas. Pandas will not scale well. Use the python csv library to read the file contents line by line (you can look into polars potentially but csv will be a lot more straightforward and simple id imagine).
@DaniallMasood add anything else you want or that I missed. And you can join the call as well if you want, you might have other things you want or might have other perspective on the data since you do most of the QC.
Instead of command line options you could instead define a JSON format for input, for example:
We need a preprocessing script that does some basic quality checks on our incoming datasets. @Reeya123 we will have to have a call to go over the requirements. Right now my list of features that would be helpful:
biomarker
field first word lowercasebest_biomarker_role
lower case (rows with multiple roles with be formatted likerole1;role2
so you'll have to split before checking)specimen
(if present) lowercaseevidence_source
resource (before colon) should be title case (e.x. in python that would bestring.title()
)id
field should have the exact same values for the following fields:biomarker
,assessed_biomarker_entity
,assessed_biomarker_entity_id
,assessed_entity_type
,condition
, andcondition_id
assessed_biomarker_entity_id
should be in the format ofresource:id
condition_id
should be in the same format as aboveid
,biomarker
,assessed_biomarker_entity
,assessed_biomarker_entity_id
,assessed_entity_type
,best_biomarker_role
exposure_agent
andexposure_agent_id
are not present, thencondition
andcondition_id
are required and vise versabiomarker
andassessed_entity_type
fields against a list of standardized terminology: risk, diagnostic, monitoring, prognostic, predictive, response, or safety (remember the role field could have multiple rows split by a semicolon) add json config file for standardized terminology and flaggingRight now this script should not make any changes to the data, it should just generate a summarization report (maybe text, log, or markdown file) and flag rows that had a potential issue. Also, do not use pandas. A lot of the data files we get are very large and you will run into memory issues using pandas. Pandas will not scale well. Use the python csv library to read the file contents line by line (you can look into polars potentially but csv will be a lot more straightforward and simple id imagine).
@DaniallMasood add anything else you want or that I missed. And you can join the call as well if you want, you might have other things you want or might have other perspective on the data since you do most of the QC.
Instead of command line options you could instead define a JSON format for input, for example: