clinical-biomarkers / biomarker-partnership

CFDE Biomarker Partnership
https://hivelab.biochemistry.gwu.edu/biomarker-partnership
1 stars 0 forks source link

Disease mapping #22

Closed seankim658 closed 8 months ago

seankim658 commented 10 months ago

Work on creating a disease map for the data processing pipeline to simplify id matching.

For example, in clinvar data, disease names are very specific, such as "esophageal squamous cell carcinoma" and "pancreatic cancer, susceptibility to 1", these should be mapped to "esophageal cancer" and "pancreatic cancer".

seankim658 commented 10 months ago

Wrote a quick script to capture all the disease name mappings. It captures most of the mappings but isn't aligned perfectly towards the later indices. During the creation of the original script, there was manual data changes done in excel. At some point during the manual code alterations, 144 rows of data was added (or incorrectly split from the raw data).

In the original jupyter notebook, ClinVar.ipynb, you can see there is 220,938 rows (Cell 26, ClinVar.ipyb) in the dataset before the manual mapping done outside of the code, this should be the final number. When the data is read back in after the manual excel changes, there is 220,926 rows (Cell 80, ClinVar.csv). Some splitting is done and the resulting number of rows is 221,082 (Cell 86, ClinVar.ipynb).

Daniall will have to help me correct some of the later mappings in the csv file output from my script. The map can be found in the rework branch (connected to this issue) of the repository in mapping_data/disease_map.csv. This can just be done manually.

seankim658 commented 10 months ago

Finished first manual pass through of the disease map csv, updated file was pushed to rework branch. @DaniallMasood will have to go in and confirm/change some of the disease mappings.