cBioPortal / GSoC

Documentation repository of Google Summer of Code (GSoC) project ideas for cBioPortal and related projects
106 stars 41 forks source link

Create an automated metadata harmonization/curation tool #111

Open shbrief opened 4 months ago

shbrief commented 4 months ago

Background: Though many omics data repositories host large volumes of datasets from diverse studies, cross-study analysis within these repositories is still somewhat limited due to the heterogeneity in their metadata structures. This lack of metadata harmonization especially impedes the application and development of machine learning tools around high-throughput biological data, which is in high demand due to the complexity and high dimensionality of multi-omics datasets. To facilitate comparable analysis across data sources through machine learning, we initiated OmicsMLRepo projects harmonizing metadata from diverse omics data repositories. Under this project, we manually reviewed metadata schema, consolidated similar or identical information spread across schema, and incorporated ontologies where possible. One of our target data repositories is cBioPortal and we have harmonized cBioPortal’s key clinical metadata across the whole data repository, not just at the study level, and incorporated ontology terms to improve the AI/ML-readiness of the cBioPortal data.

We performed a manual inspection of clinical metadata from 375 studies in cBioPortal (available on 5/13/2023) and harmonized major attributes, such as treatment, demographic information (e.g., age, sex, etc.), and disease. For example, 24 different values (e.g., RADIO_THERAPY, Rad, XRT, etc.) categorized as ‘treatment_type’ were harmonized into a single ontology term, “Radiation Therapy” (NCIT:C15313). While the comparability of the 375 datasets has been improved a lot, cBioPortal is continuously growing and we want to harmonize/digest new data to follow the data dictionary established under the OmicsMLRepo project. To reduce this maintenance effort, we would like to create an automated data harmonization tool.

The main approach we are currently considering is using semantic similarity. Understanding the meaning of a set of terms is often not straightforward because words might be different but meanings might be the same (e.g., leukemia and blood cancer). “Semantic similarity search” is searching by meaning rather than by word through “encoding”. Encoding is a way of transforming words or sentences into vectors of numbers, such that the points in N-dimensional space (usually 700~2,000), where points near each other have similar meanings. We want to encode both curated terms (from our data dictionary) and uncurated terms (from new incoming data), compare them, and map uncurated terms into curated terms.

Goal:

  1. Create an automated data harmonization/digestion tool based on semantic similarity search.
  2. Create an interactive dashboard showing (and potentially allowing edits on) the automated harmonization results.

Approach:

Need skills: Python R

Possible mentors: Sehyun Oh (@shbrief), Sean Davis (@seandavi)

If you are interested: Anyone interested in this project, please try the EDA below and email your EDA work to Sehyun.Oh@sph.cuny.edu. Looking forward to hearing your idea. Thanks!

  1. How can you harmonize new_meta to follow the curated_meta schema with minimum manual curation? You can sketch the overall process or select a specific attributes to demonstrate your idea. (metadata_samples.zip)
  2. automated_metadata_curation.ipynb (curated_bodysite.csv)
HarshaTejaswi commented 4 months ago

Hello maintainers, I am interested in doing this but I will need a team, for now, I can create a functional prototype of the automated data harmonization tool and an interactive dashboard.

Please acknowledge if I can proceed.

RINO-GAELICO commented 4 months ago

Hi (@shbrief), this seems a very interesting project to work on. I am familiar with cosine similarity and vectorization of words. I have used Word2vec in the past but I am new to txt2onto, which seems specialized in tissue and cell-type annotations. I also like the challenge of harmonizing metadata and map uncurated terms into curated terms. What kind of input would be fed into the tool? How is the study structured?

iOmkarNikam commented 4 months ago

Hello Mentors Team cBioPortal !

Sehyun Oh (@shbrief), Sean Davis (@seandavi)

I'm Omkar Nikam ,I am second year student studying Artificial Intelligence and Machine Learning (CSE) from KITCOEK(INDIA). I am Very Keen Learner , My Expertise Include Python , C , C++ , JAVA , HTML , CSS , JavaScript , PHP . I'm beginner in Open Source ,My Interest Include Biology , Medicine , HealthCare , Pharmacy , public health, and biomedical research , Because of my past Education . Presently I'm Passionate for Technology . And I am interested to work with cBioPortal Projects for GSOC 2024 , particularly in ; Create an automated metadata harmonization/curation tool #111

I am having good exposure to Machine Learning Concepts , And I Have started learning R Programming Language , soon I'll be very comfortable with it .

Please , guide me for next steps !

Sincerely , Omkar

shbrief commented 4 months ago

Thank you for expressing your interest! @HarshaTejaswi @RINO-GAELICO @iOmkarNikam Anyone interested in this project, please try one of these two EDA and share your work. Please email your EDA work to Sehyun.Oh@sph.cuny.edu. Looking forward to hearing your idea!

  1. How can you harmonize new_meta to follow the curated_meta schema with minimum manual curation? You can sketch the overall process or select a specific attributes to demonstrate your idea. (metadata_samples.zip)
  2. automated_metadata_curation.ipynb (curated_bodysite.csv)
RINO-GAELICO commented 3 months ago

Hi @shbrief I am working on number 2 and I have a question. How are we suppose to "create a set of uncurated values" ? Are those data points that have NA in their columns from [curated_bodysite.csv] or we need to collect them from somewhere else? Thanks!

shbrief commented 3 months ago

@RINO-GAELICO The original_bodysite column contains the values before harmonization/curation. So a part of them can be used as a test dataset representing uncurated/incoming data.

praveshkumar1 commented 3 months ago

hi @shbrief , I also wanted to contribute in this issue , can I also start with the EDA of the resources you provided above

VaishnaviMudaliar commented 3 months ago

Hello @shbrief , Vaishnavi Mudaliar this side :) Currently, I am pursuing a Master's of Technology in Artificial Intelligence and am keenly interested in this project for Google Summer of Code 2024. I am well versed in machine learning and deep learning techniques, and I am extremely passionate about transformers. I would truly be grateful if I got a chance to work with this one! For your reference, I am attaching the link to my colab where I code out the solution for the second task among the tasks mentioned above.

RINO-GAELICO commented 3 months ago

Hi @shbrief,

Could you please take a look at my colab notebook and provide feedback?

[Colab notebook]() (removed)

Thank you

shbrief commented 3 months ago

Thanks @VaishnaviMudaliar @RINO-GAELICO!

Could you remove the link to your colab notebook from your reply and directly send it to me? My email is Sehyun.Oh@sph.cuny.edu. Thanks.

VaishnaviMudaliar commented 3 months ago

Thanks @VaishnaviMudaliar @RINO-GAELICO!

Could you remove the link to your colab notebook from your reply and directly send it to me? My email is Sehyun.Oh@sph.cuny.edu. Thanks.

Mailed, thanks a lot :)

VaishnaviMudaliar commented 3 months ago

Hello @shbrief , could you please let me know where to check if we have successfully gotten through the evaluation task or not? Many thanks.

HarshaTejaswi commented 3 months ago

Hello @shbrief , I have made a prototype with the architecture explanation and I have mailed the colab link to you please check. Thankyou

adhal007 commented 3 months ago

Hi @shbrief,

I've worked as a data scientist for 3+ years in the bioinformatics domain. Having a background in biophysics and statistical genetics (M.S from UC Davis, B.Tech from IIT-BHU), I've passionate about streamlining ML and AI based pipelines in bioinformatics. One of my personal projects in the domain of automation and streamlining of bioinformatics tools and ML is https://github.com/adhal007/OmixHub.

I've worked out a basic solution for the automated metadata curation (Problem 2) and emailed you the code for the same. I look forward to getting your feedback around my approach and discuss any prospective avenues to contribute to this exciting project at GSOC 2024!

Best Regards, Abhilash Dhal

manheraa commented 3 months ago

Hi @shbrief , I am Sachetan Heralagi , currently persuing my Bachelor of Engineering Degree in Computer Science and Engineering .In my academic journey, I have had the opportunity to explore various neural network architectures, ranging from simple ANNs to more complex models like VGG19. One of the projects I take pride in is leading a team to develop an automated irrigation system using ANN technology for weather prediction and integration with Internet of Things (IoT) devices. Our project even made it to the finals of IEEE YESIST12, which was a significant achievement for us. I am reaching out to express my interest in becoming a part of the this community and to inquire about any opportunities to collaborate or contribute to ongoing projects.I have worked on basics of automated metadata curation and mailed it to the above mentioned mail and I have also sent the proposal in which i mentioned different entity resolution framework approach in an another mail .