gboris / blkbox

Data exploration with multiple machine learning algorithms
14 stars 4 forks source link

error running blkbox function #55

Closed natabloch closed 7 years ago

natabloch commented 7 years ago

Hi,

I am very new to machine learning so I apologize if this is a basic question. I am attempting to run blkbox on RNAseq data. I have sequenced the transcriptome from two different treatments in multiple lines. I would like to use the data from one of the lines (LB) as training to identify genes with different expression patterns between treatments, and the data from another line (Wildtype) as holdout to run the model.

Basically, my data is normalized expression for a subset of genes in 3 samples for treatment ATT and 3 samples for treatment OPT.

When I try to run blkbox I get the following error model_1 <- blkbox(LB_data, LB_labels, WT_data, WT_labels)

Error in model@fit(data, ...) : fraction of 0.000000 is too small here LB_data and LB_labels are the training data and labels, WT_data and WT_labels are the holdout data and labels.

Do you have any idea what this error means? is my sample size too small for this type of analysis? Or is it due to some genes having o expression values?

I copied my full code below and attached the data. I apologize if the code is not very efficient, I am fairly new at this.

`AvsU_genes.df<-read.table(TEL_AvsU_DEgenes_xblkbox.txt" ), stringsAsFactors=F); head(AvsU_genes.df)
AvsU_genes.df<- AvsU_genes.df[,!(grepl("FEM", names(AvsU_genes.df)))]; head(AvsU_genes.df) # remove control data not need here

#change gene names to remove special characters
new.names<-vector()
for (i in 1:nrow(AvsU_genes.df) ){ 
  gene=row.names(AvsU_genes.df[i,])
    if(grepl("TRINITY", gene)==FALSE) {
     new.names[i]<-(strsplit(gene, split="=")[[1]])[2]
     }
else {
  new.names[i]<-(strsplit(gene, split="_c")[[1]])[1]
  }
}
str(new.names); new.names
row.names(AvsU_genes.df)<-new.names; head(AvsU_genes.df)

library(blkbox)  
LB.names<-names(AvsU_genes.df)[grepl("LB" , names(AvsU_genes.df))]
WT.names<-names(AvsU_genes.df)[grepl("WT" , names(AvsU_genes.df))]

#Training data: LB lines data
LB_data<-AvsU_genes.df[,LB.names]; head(LB_data); names(LB_data); dim(LB_data)
LB_data<-as.data.frame(t(LB_data)) # transpose so data is in the right format for blkbox
LB_Treat<- factor(substring(row.names(LB_data),1,3)); LB_Treat
LB_labels <- as.character(LB_Treat)
unique(LB_labels)

#Holdout data: WT lines data
WT_data<-AvsU_genes.df[,WT.names]; head(WT_data)
WT_data<-as.data.frame(t(WT_data))
WT_Treat<- factor(substring(row.names(WT_data),1,3)); WT_Treat
WT_labels <- as.character(WT_Treat)
unique(WT_labels)

#check names are the same
all(colnames(LB_data) == colnames(WT_data))

#Creating a Training & Testing Model
model_1 <- blkbox(LB_data, LB_labels, WT_data, WT_labels)`

TEL_AvsU_DEgenes_xblkbox.txt

zacdav commented 7 years ago

Thanks for checking out blkbox,

I've had a look at your data and code, I think you have some minor misunderstandings which i'll address. For these algorithms to learn you typically want as many samples as you can, blkbox deals with tasks that aim to classify samples between binary outcomes. So in your case, you would be interested in predicting either LB or WT.

From what I understood in your code it seems like you were trying to train exclusively on LB samples and then predict on WT samples. This won't work, a model that has only ever seen oranges will simply not know what to do when it see's something that's not an orange, an apple for example.

So the idea would be to randomly let blkbox select some WT and LB samples, learn what makes WT and LB different, and then ask it to predict what the outcome would be for samples where it isn't shown if they are WT or LB.

This code might be as follows, I've simply divided your data into WT and Other for this example.

# libraries
library(readr)
library(tidyverse)
library(blkbox)

# import data
gene_data <- read_delim("C:/Users/zac/Downloads/TEL_AvsU_DEgenes_xblkbox.txt", 
                        "\t", escape_double = FALSE, trim_ws = TRUE)

# remove control data not need here
gene_data_subset <- gene_data[,!(grepl("FEM", names(gene_data)))]

# fixing first column name
colnames(gene_data_subset)[1] <- "genes"

# cleaning up gene names
new_names <- ifelse(grepl("TRINITY", gene_data_subset$genes),
                    gsub("(.*)_c.*", "\\1", gene_data_subset$genes),
                    gsub(".*=(.*)", "\\1", gene_data_subset$genes))

# reassign
gene_data_subset$genes <- new_names

# transpose and adjust
gene_data_subset2 <- as.data.frame(t(gene_data_subset), stringsAsFactors = F)
colnames(gene_data_subset2) <- as.character(gene_data_subset2[1, ])
gene_data_subset2 <- cbind(sample = colnames(gene_data_subset)[-1], 
                           gene_data_subset2[-1,])

# Response Column
response = ifelse(grepl("WT", gene_data_subset2$sample), "WT", "Other")

my_partition = Partition(data = gene_data_subset2[,-1],
                         labels = response)

# Creating a Training & Testing Model
model_1 <- blkbox(data = my_partition, exclude =  c("kknn", "bartmachine", "party", "PamR", "GLM", "nnet", "SVM", "xgboost"))
# Calculate Performance
perf = Performance(model_1)
# Standard ROC curve
blkboxROC(perf)

I would be extremely hesitant to trust any results from data this small however, I did have some trouble getting this data to work with some algorithms, I believe this is due to sample size.

natabloch commented 7 years ago

Hi Zachary,

Thank you so much for taking the time to help me with this!

You misunderstood the design of the experiment slightly (probably because I did ate explain it well enough, sorry) but that does not change the sample size issue so I might not be able to use this approach anyway.

WT and LB are two lines of laboratory lines that have the same behavioral phenotype, that is they exhibit the same response to treatment “ATT” in behavioral trials. We ran de same experiment in parallel for WT and LB lines and sequenced the brain transcriptome after behavioral trials to investigate the genetic basis of the observed behavior. So for WT and LB lines we ran 2 treatments, ATT and the control labeled UNA and we have 3 replicates in each case. Thus, for each line we have 3 samples for the ATT treatment and 3 samples for the UNA treatment.

But ultimately the goal is to identify genes that differ between the ATT treatment and the UNA control. I wanted to training the model using the ATT vs control data in LB and then use it to classify samples in the WT line according to treatment. Does that sound more reasonable?

Even if technically the setup makes sense, I only have 3 samples per treatment (6 total) to train the model and I don’t know whether that would be enough?

I am really sorry to ask all these questions about such an unusual set-up. ML learning is not really used in evolutionary genomics and I know the sample sizes are very different from experiments for which this type of algorithm was designed. Each of these samples required hours of behavioral trials! But I think this type of analysis could be an amazing contribution in evolutionary biology if we can apply it to these smaller datasets!

Thank you again, I hope I am not stealing too much of your time!

Natasha

On Jun 5, 2017, at 7:52 PM, Zachary Davies notifications@github.com<mailto:notifications@github.com> wrote:

Thanks for checking out blkbox,

I've had a look at your data and code, I think you have some minor misunderstandings which i'll address. For these algorithms to learn you typically want as many samples as you can, blkbox deals with tasks that aim to classify samples between binary outcomes. So in your case, you would be interested in predicting either LB or WT.

From what I understood in your code it seems like you were trying to train exclusively on LB samples and then predict on WT samples. This won't work, a model that has only ever seen oranges will simply not know what to do when it see's something that's not an orange, an apple for example.

So the idea would be to randomly let blkbox select some WT and LB samples, learn what makes WT and LB different, and then ask it to predict what the outcome would be for samples where it isn't shown if they are WT or LB.

This code might be as follows, I've simply divided your data into WT and Other for this example.

libraries

library(readr) library(tidyverse) library(blkbox)

import data

gene_data <- read_delim("C:/Users/zac/Downloads/TEL_AvsU_DEgenes_xblkbox.txt", "\t", escape_double = FALSE, trim_ws = TRUE)

remove control data not need here

gene_data_subset <- gene_data[,!(grepl("FEM", names(gene_data)))]

fixing first column name

colnames(gene_data_subset)[1] <- "genes"

cleaning up gene names

new_names <- ifelse(grepl("TRINITY", gene_data_subset$genes), gsub("(.)_c.", "\1smb://1", gene_data_subset$genes), gsub(".=(.)", "\1smb://1", gene_data_subset$genes))

reassign

gene_data_subset$genes <- new_names

transpose and adjust

gene_data_subset2 <- as.data.frame(t(gene_data_subset), stringsAsFactors = F) colnames(gene_data_subset2) <- as.character(gene_data_subset2[1, ]) gene_data_subset2 <- cbind(sample = colnames(gene_data_subset)[-1], gene_data_subset2[-1,])

Response Column

response = ifelse(grepl("WT", gene_data_subset2$sample), "WT", "Other")

my_partition = Partition(data = gene_data_subset2[,-1], labels = response)

Creating a Training & Testing Model

model_1 <- blkbox(data = my_partition, exclude = c("kknn", "bartmachine", "party", "PamR", "GLM", "nnet", "SVM", "xgboost"))

Calculate Performance

perf = Performance(model_1)

Standard ROC curve

blkboxROC(perf)

I would be extremely hesitant to trust any results from data this small however, I did have some trouble getting this data to work with some algorithms, I believe this is due to sample size.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/gboris/blkbox/issues/55#issuecomment-306350624, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZa0oUCSq1pleN-Lx4MPdxxwovgFFoblks5sBKLNgaJpZM4NwPsl.

zacdav commented 7 years ago

Yea six is still not enough. personally, i would not try this kind of analysis unless you had ~40 samples. Ideally 80++.

I mean, in theory you can use it for a sample size that low, but it would not produce robust or reliable results in my opinion.

Unfortunately I also think there is an inherent issue in the way you wish to run the analysis. Let me explain my thinking, ill be referring to WT and LB as W/L, and ATT/UNA as T/C.

We will use 6 of each: TW TW TW TL TL TL | CW CW CW CL CL CL

What you are suggesting is:

Train: TL TL TL CL CL CL

Test: TW TW TW CW CW CW

The main issue i see is that by restricting the training and testing at a secondary level by only allowing W or L to exist with others of that classification means that you can only separate your data in one way.

This therefore eliminates any way of doing cross-fold validation, If you had more samples you could sample from a larger population to ensure variance in both training and testing sets. This allows you to build confidence intervals, assess performance variance and address overfitting.

I might be wrong in that it doesn't matter, but I just wouldn't place my confidence in the results.

I agree with you that it would be a great to use ML for said problem, but unfortunately requires more samples.

This is just one quick example to show that number of samples matters https://arxiv.org/pdf/1211.1323.pdf

natabloch commented 7 years ago

Thank you so much Zachary. I really appreciate all this feedback. For the current experiment we would only rely on ML to validate a group of DE genes that we obtained with the methods before we use them in downstream analysis. So I am just going to play around with this to see what comes out of it. However we are currently designing some pedigree that we want to analyze with ML and everything you said on sample sizes and experimental design will be very useful.

Thank you again for taking the time to help me understand this!

Natasha

Natasha Bloch, PhD NSF Post-doctoral Fellow in Biology Marie Curie research fellow Department of Genetics, Evolution and the Environment University College London


The Darwin Building Gower Street, London WC1E 6BT Tel: 020 7679 2170 (ext. 32170) E-mail: n.bloch@ucl.ac.ukmailto:n.bloch@ucl.ac.uk http://home.uchicago.edu/~nbloch

On Jun 6, 2017, at 7:36 PM, Zachary Davies notifications@github.com<mailto:notifications@github.com> wrote:

Yea six is still not enough. personally, i would not try this kind of analysis unless you had ~40 samples. Ideally 80++.

I mean, in theory you can use it for a sample size that low, but it would not produce robust or reliable results in my opinion.

Unfortunately I also think there is an inherent issue in the way you wish to run the analysis. Let me explain my thinking, ill be referring to WT and LB as W/L, and ATT/UNA as T/C.

We will use 6 of each: TW TW TW TL TL TL | CW CW CW CL CL CL

What you are suggesting is:

Train: TL TL TL CL CL CL

Test: TW TW TW CW CW CW

The main issue i see is that by restricting the training and testing at a secondary level by only allowing W or L to exist with others of that classification means that you can only separate your data in one way.

This therefore eliminates any way of doing cross-fold validation, If you had more samples you could sample from a larger population to ensure variance in both training and testing sets. This allows you to build confidence intervals, assess performance variance and address overfitting.

I might be wrong in that it doesn't matter, but I just wouldn't place my confidence in the results.

I agree with you that it would be a great to use ML for said problem, but unfortunately requires more samples.

This is just one quick example to show that number of samples matters https://arxiv.org/pdf/1211.1323.pdf

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/gboris/blkbox/issues/55#issuecomment-306652615, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZa0oXr9Az5Gf9i5gtnVu8SZogzEqZqDks5sBfCRgaJpZM4NwPsl.

zacdav commented 7 years ago

good luck. feel free to reach out to Boris or myself if you have questions.