databio / bdshack19

Coordinating the 2019 Biomedical Data Science Hackathon at UVA
3 stars 3 forks source link

Machine learning functions #3

Open gregmedlock opened 5 years ago

gregmedlock commented 5 years ago

We'd like to develop general functions for predicting one layer of data from the other data type for each dataset (e.g. predict protein abundances from RNA, predict RNA from ATACseq).

Eventually, these functions should interface with the data object we are also developing (e.g. they could each be functions that takes a data object as input).

Brainstorm analysis ideas and implementation strategies below

DerekBivona commented 5 years ago

I have a simple script that uses regression (machine learning) models to predict a continuous outcome. It utilizes linear regression (LR), random forest regression, support vector machine (SVM) regression, and multi-layer perceptron (MLP) regression. The script trains each model with the same training data while also testing each model with the same testing data (through cross-validation). It also optimizes several hyperparameters for RF, SVM, and MLP. I output an average Rsquared value per fold of cross-validation. There is no feature selection associated with this script. Let me know if this could be used with our hackathon objective!

Yaseswini commented 5 years ago

Probably yes. We were thinking to fit a regression model ( ? )or a simple correlation between a gene's expression and atac peaks intensity in the promoter region across all the cells to identify genes whose expression can be explained by the presence of atac peaks. Any thoughts?

nsheff commented 5 years ago

sure, like @Yaseswini says -- can you figure out how to add it as a function on the MultiAddData object to do some predictions across modalities (from one data type, like RNA, to the other, like ATAC) ?

DerekBivona commented 5 years ago

I'm adding the script to the mixsc folder (named 'RegressionML.py') & will try to incorporate it into the MultiAnnData object!

Yaseswini commented 5 years ago

Ok! Let me know if you need help with it

DerekBivona commented 5 years ago

I’m having trouble with the data! Can someone help? All we need to do is input the features as the variable X and the target as variable Y.

nsheff commented 5 years ago

What do you mean exactly? can you point to more details about where you're referring?

nsheff commented 5 years ago

@DerekBivona see if this helps: I've uploaded a notebook that shows how to use the MAD objects:

https://github.com/databio/bdshack19/blob/master/examples/use_MAD.ipynb

You would want to use one modality (mad.RNA) as X and another modality (mad.ATAC) as Y.