Closed natabloch closed 7 years ago
Thanks for checking out blkbox,
I've had a look at your data and code, I think you have some minor misunderstandings which i'll address. For these algorithms to learn you typically want as many samples as you can, blkbox deals with tasks that aim to classify samples between binary outcomes. So in your case, you would be interested in predicting either LB or WT.
From what I understood in your code it seems like you were trying to train exclusively on LB samples and then predict on WT samples. This won't work, a model that has only ever seen oranges will simply not know what to do when it see's something that's not an orange, an apple for example.
So the idea would be to randomly let blkbox select some WT and LB samples, learn what makes WT and LB different, and then ask it to predict what the outcome would be for samples where it isn't shown if they are WT or LB.
This code might be as follows, I've simply divided your data into WT and Other for this example.
# libraries
library(readr)
library(tidyverse)
library(blkbox)
# import data
gene_data <- read_delim("C:/Users/zac/Downloads/TEL_AvsU_DEgenes_xblkbox.txt",
"\t", escape_double = FALSE, trim_ws = TRUE)
# remove control data not need here
gene_data_subset <- gene_data[,!(grepl("FEM", names(gene_data)))]
# fixing first column name
colnames(gene_data_subset)[1] <- "genes"
# cleaning up gene names
new_names <- ifelse(grepl("TRINITY", gene_data_subset$genes),
gsub("(.*)_c.*", "\\1", gene_data_subset$genes),
gsub(".*=(.*)", "\\1", gene_data_subset$genes))
# reassign
gene_data_subset$genes <- new_names
# transpose and adjust
gene_data_subset2 <- as.data.frame(t(gene_data_subset), stringsAsFactors = F)
colnames(gene_data_subset2) <- as.character(gene_data_subset2[1, ])
gene_data_subset2 <- cbind(sample = colnames(gene_data_subset)[-1],
gene_data_subset2[-1,])
# Response Column
response = ifelse(grepl("WT", gene_data_subset2$sample), "WT", "Other")
my_partition = Partition(data = gene_data_subset2[,-1],
labels = response)
# Creating a Training & Testing Model
model_1 <- blkbox(data = my_partition, exclude = c("kknn", "bartmachine", "party", "PamR", "GLM", "nnet", "SVM", "xgboost"))
# Calculate Performance
perf = Performance(model_1)
# Standard ROC curve
blkboxROC(perf)
I would be extremely hesitant to trust any results from data this small however, I did have some trouble getting this data to work with some algorithms, I believe this is due to sample size.
Hi Zachary,
Thank you so much for taking the time to help me with this!
You misunderstood the design of the experiment slightly (probably because I did ate explain it well enough, sorry) but that does not change the sample size issue so I might not be able to use this approach anyway.
WT and LB are two lines of laboratory lines that have the same behavioral phenotype, that is they exhibit the same response to treatment “ATT” in behavioral trials. We ran de same experiment in parallel for WT and LB lines and sequenced the brain transcriptome after behavioral trials to investigate the genetic basis of the observed behavior. So for WT and LB lines we ran 2 treatments, ATT and the control labeled UNA and we have 3 replicates in each case. Thus, for each line we have 3 samples for the ATT treatment and 3 samples for the UNA treatment.
But ultimately the goal is to identify genes that differ between the ATT treatment and the UNA control. I wanted to training the model using the ATT vs control data in LB and then use it to classify samples in the WT line according to treatment. Does that sound more reasonable?
Even if technically the setup makes sense, I only have 3 samples per treatment (6 total) to train the model and I don’t know whether that would be enough?
I am really sorry to ask all these questions about such an unusual set-up. ML learning is not really used in evolutionary genomics and I know the sample sizes are very different from experiments for which this type of algorithm was designed. Each of these samples required hours of behavioral trials! But I think this type of analysis could be an amazing contribution in evolutionary biology if we can apply it to these smaller datasets!
Thank you again, I hope I am not stealing too much of your time!
Natasha
On Jun 5, 2017, at 7:52 PM, Zachary Davies notifications@github.com<mailto:notifications@github.com> wrote:
Thanks for checking out blkbox,
I've had a look at your data and code, I think you have some minor misunderstandings which i'll address. For these algorithms to learn you typically want as many samples as you can, blkbox deals with tasks that aim to classify samples between binary outcomes. So in your case, you would be interested in predicting either LB or WT.
From what I understood in your code it seems like you were trying to train exclusively on LB samples and then predict on WT samples. This won't work, a model that has only ever seen oranges will simply not know what to do when it see's something that's not an orange, an apple for example.
So the idea would be to randomly let blkbox select some WT and LB samples, learn what makes WT and LB different, and then ask it to predict what the outcome would be for samples where it isn't shown if they are WT or LB.
This code might be as follows, I've simply divided your data into WT and Other for this example.
library(readr) library(tidyverse) library(blkbox)
gene_data <- read_delim("C:/Users/zac/Downloads/TEL_AvsU_DEgenes_xblkbox.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
gene_data_subset <- gene_data[,!(grepl("FEM", names(gene_data)))]
colnames(gene_data_subset)[1] <- "genes"
new_names <- ifelse(grepl("TRINITY", gene_data_subset$genes), gsub("(.)_c.", "\1smb://1", gene_data_subset$genes), gsub(".=(.)", "\1smb://1", gene_data_subset$genes))
gene_data_subset$genes <- new_names
gene_data_subset2 <- as.data.frame(t(gene_data_subset), stringsAsFactors = F) colnames(gene_data_subset2) <- as.character(gene_data_subset2[1, ]) gene_data_subset2 <- cbind(sample = colnames(gene_data_subset)[-1], gene_data_subset2[-1,])
response = ifelse(grepl("WT", gene_data_subset2$sample), "WT", "Other")
my_partition = Partition(data = gene_data_subset2[,-1], labels = response)
model_1 <- blkbox(data = my_partition, exclude = c("kknn", "bartmachine", "party", "PamR", "GLM", "nnet", "SVM", "xgboost"))
perf = Performance(model_1)
blkboxROC(perf)
I would be extremely hesitant to trust any results from data this small however, I did have some trouble getting this data to work with some algorithms, I believe this is due to sample size.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/gboris/blkbox/issues/55#issuecomment-306350624, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZa0oUCSq1pleN-Lx4MPdxxwovgFFoblks5sBKLNgaJpZM4NwPsl.
Yea six is still not enough. personally, i would not try this kind of analysis unless you had ~40 samples. Ideally 80++.
I mean, in theory you can use it for a sample size that low, but it would not produce robust or reliable results in my opinion.
Unfortunately I also think there is an inherent issue in the way you wish to run the analysis. Let me explain my thinking, ill be referring to WT and LB as W/L, and ATT/UNA as T/C.
We will use 6 of each: TW TW TW TL TL TL | CW CW CW CL CL CL
What you are suggesting is:
Train: TL TL TL CL CL CL
Test: TW TW TW CW CW CW
The main issue i see is that by restricting the training and testing at a secondary level by only allowing W or L to exist with others of that classification means that you can only separate your data in one way.
This therefore eliminates any way of doing cross-fold validation, If you had more samples you could sample from a larger population to ensure variance in both training and testing sets. This allows you to build confidence intervals, assess performance variance and address overfitting.
I might be wrong in that it doesn't matter, but I just wouldn't place my confidence in the results.
I agree with you that it would be a great to use ML for said problem, but unfortunately requires more samples.
This is just one quick example to show that number of samples matters https://arxiv.org/pdf/1211.1323.pdf
Thank you so much Zachary. I really appreciate all this feedback. For the current experiment we would only rely on ML to validate a group of DE genes that we obtained with the methods before we use them in downstream analysis. So I am just going to play around with this to see what comes out of it. However we are currently designing some pedigree that we want to analyze with ML and everything you said on sample sizes and experimental design will be very useful.
Thank you again for taking the time to help me understand this!
Natasha
Natasha Bloch, PhD NSF Post-doctoral Fellow in Biology Marie Curie research fellow Department of Genetics, Evolution and the Environment University College London
The Darwin Building Gower Street, London WC1E 6BT Tel: 020 7679 2170 (ext. 32170) E-mail: n.bloch@ucl.ac.ukmailto:n.bloch@ucl.ac.uk http://home.uchicago.edu/~nbloch
On Jun 6, 2017, at 7:36 PM, Zachary Davies notifications@github.com<mailto:notifications@github.com> wrote:
Yea six is still not enough. personally, i would not try this kind of analysis unless you had ~40 samples. Ideally 80++.
I mean, in theory you can use it for a sample size that low, but it would not produce robust or reliable results in my opinion.
Unfortunately I also think there is an inherent issue in the way you wish to run the analysis. Let me explain my thinking, ill be referring to WT and LB as W/L, and ATT/UNA as T/C.
We will use 6 of each: TW TW TW TL TL TL | CW CW CW CL CL CL
What you are suggesting is:
Train: TL TL TL CL CL CL
Test: TW TW TW CW CW CW
The main issue i see is that by restricting the training and testing at a secondary level by only allowing W or L to exist with others of that classification means that you can only separate your data in one way.
This therefore eliminates any way of doing cross-fold validation, If you had more samples you could sample from a larger population to ensure variance in both training and testing sets. This allows you to build confidence intervals, assess performance variance and address overfitting.
I might be wrong in that it doesn't matter, but I just wouldn't place my confidence in the results.
I agree with you that it would be a great to use ML for said problem, but unfortunately requires more samples.
This is just one quick example to show that number of samples matters https://arxiv.org/pdf/1211.1323.pdf
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/gboris/blkbox/issues/55#issuecomment-306652615, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZa0oXr9Az5Gf9i5gtnVu8SZogzEqZqDks5sBfCRgaJpZM4NwPsl.
good luck. feel free to reach out to Boris or myself if you have questions.
Hi,
I am very new to machine learning so I apologize if this is a basic question. I am attempting to run blkbox on RNAseq data. I have sequenced the transcriptome from two different treatments in multiple lines. I would like to use the data from one of the lines (LB) as training to identify genes with different expression patterns between treatments, and the data from another line (Wildtype) as holdout to run the model.
Basically, my data is normalized expression for a subset of genes in 3 samples for treatment ATT and 3 samples for treatment OPT.
When I try to run blkbox I get the following error
model_1 <- blkbox(LB_data, LB_labels, WT_data, WT_labels)
Do you have any idea what this error means? is my sample size too small for this type of analysis? Or is it due to some genes having o expression values?
I copied my full code below and attached the data. I apologize if the code is not very efficient, I am fairly new at this.
TEL_AvsU_DEgenes_xblkbox.txt