Closed vd4mmind closed 3 years ago
I modified the code a bit to make my small_fasta
have the same transcripts and same length as in the fold_change
and in the reads_per_transcript
but still I run with the same error. Do I need to convert the folchange file in matrix format?
load("/scratch/GT/vdas/benchmark/dataForSimulation.Rdata",verbose=TRUE)
fasta_file = system.file('extdata', 'transcriptome.idx.fa', package='polyester')
count_transcripts(fasta_file)
## number of transcript is in this file is 50221
length(reads_per_transcript)
length(foldchanges)
dim(foldchanges)
fasta = readDNAStringSet(fasta_file)
sum(row.names(foldchanges) %in% names(fasta))
# 49273
foldchanges2 <- foldchanges[which(row.names(foldchanges) %in% names(fasta)),]
dim(foldchanges2)
#[1] 49273 8
small_fasta <- fasta[row.names(foldchanges2)]
rpt2 <- reads_per_transcript[row.names(foldchanges2)]
length(rpt2)
## 49273
writeXStringSet(small_fasta, 'trancriptome_idx.fa')
readspertx = rpt2
simulate_experiment('trancriptome_idx.fa', reads_per_transcript=readspertx, num_reps=c(4,4), readlen=100, paired=TRUE, distr="empirical", error_model="illumina5", bias="rnaf", fold_changes=foldchanges2, outdir='simulated_reads')
Can you tell me why am still getting the same error now?
Error
Error in simulate_experiment("trancriptome_idx.fa", reads_per_transcript = readspertx, : fold_changes must be a vector with one entry per transcript in fasta or gtf; use count_transcripts to find out how many transcripts are in fasta or gtf.
Hi @alyssafrazee I made the changes in my code and removed the transcripts which had 0 reads and fashioned my foldchange vector for the same transcripts that were there in the reads_per_transcript file. Then I reduced my fasta file and am keeping the same transcripts with the sequences that are there in the foldchanges file and the reads_per_transcripts file. Still I run into the error. Below is the code I made for making the above changes and it is followed by the erorr and the traceback. Can you let me know where am getting wrong.
library(polyester)
library(Biostrings)
load("/scratch/GT/vdas/benchmark/dataForSimulation.Rdata",verbose=TRUE) fasta_file = system.file('extdata', 'transcriptome.idx.fa', package='polyester')
count_transcripts(fasta_file)
## number of transcript is in this file is 50221
length(reads_per_transcript)
length(foldchanges)
dim(foldchanges)
fasta = readDNAStringSet(fasta_file)
rpt1<-reads_per_transcript[reads_per_transcript!=0]
doldchanges1<-foldchanges[names(rpt1),]
g <- names(rpt1)
g <- g[which(g %in% names(fasta))]
length(g)
# [1] 18631
dim(doldchanges1)
# [1] 18675 8
rpt1 <- rpt1[g]
foldchanges1 <- foldchanges[g,]
fasta2 <- fasta[g]
dim(foldchanges1)
# [1] 18631 8
length(rpt1)
# [1] 18631
writeXStringSet(fasta2, 'trancriptome_clean_idx.fa')
fcsample5<-foldchanges1$sample5
readspertx = rpt1
simulate_experiment('trancriptome_clean_idx.fa', reads_per_transcript=readspertx, num_reps=c(1,1), readlen=100, paired=TRUE, distr="empirical", error_model="illumina5", bias="rnaf", fold_changes=fcsample5, outdir='simulated_reads')
Below is the error Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") : subscript out of bounds In addition: Warning message: In simulate_experiment("trancriptome_clean_idx.fa", reads_per_transcript = readspertx, : provided fold changes mean that one group will have more overall expression than the other, so some of the DE signal might be lost due to library size adjustment.
traceback()
6: .Call(.NAME, ..., PACKAGE = PACKAGE)
5: .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings")
4: Biostrings::unlist(tFrags)
3: Biostrings::unlist(tFrags)
2: add_error(rctFrags, error_rate)
1: simulate_experiment("trancriptome_clean_idx.fa", reads_per_transcript = readspertx,
num_reps = c(1, 1), readlen = 100, paired = TRUE, distr = "empirical",
error_model = "illumina5", bias = "rnaf", fold_changes = fcsample5,
outdir = "simulated_reads")
Can you tell me why is this error happening now? This problem is from the package Biostrings? But my transcript fasta, reads_per_transcript and foldchange file are all in the same order of fashion where the transcript ids are in same order.
Thanks for these reports. I'm traveling this week but will look into debugging when I get a chance.
Sure , thanks a lot for the reply.
Are the files small enough that you could post a small example of your reads_per_transcript
and your fold_changes
along with your fasta file? That way I can try to debug. I wonder if it is because the reads_per_transcript
is the wrong class?
Best
Jeff
Echoing Jeff's comment above: I don't see anything obviously wrong with the code you provided, so we might need to actually see your fasta file and whatever is in reads_per_transcript
to debug further. If you are able to share, that would be helpful! Could you also please post the output of sessionInfo()
?
Sorry I was away this weekend so could not reply. What I understood debugging on my own is one can only put a vector of FC changes for a group to induce the differential expression rather than for both the groups. Then the variability can be induced for groups but not within replicates. But that is fine I generated one vs one fasta files. The reads per transcript cannot be 0 in the count file of the reads_per_transcript
. For every sequences having a length should have counts , there cannot be a transcript with 0 counts of read in the reads_per_transcript
file. So I have to do away with the reads_per_transcript
file keeping it empty and use the meanmodel=TRUE
parameter. Below is the piece of code I have used.
length(fasta2[[1]]) # [1] 1255
lengths <- lapply(fasta2,FUN=length)
min(lengths)
sapply(head(fasta2),FUN=length)
lengths <- sapply(fasta2,FUN=length)
min(lengths)``` # [1] 168
w <- which(lengths > 500)
fasta3 <- fasta2[w]
foldchanges3 <- foldchanges1[w]
foldchanges3 <- foldchanges1[w,]
rpt3 <- rpt1[w]
writeXStringSet(fasta3, 'trancriptome_clean_idx.fa')
simulate_experiment('trancriptome_clean_idx.fa', reads_per_transcript=, num_reps=c(1,1), readlen=100, paired=TRUE, distr="empirical", error_model="illumina5", bias="rnaf", fold_changes=foldchanges3$sample5, outdir='simulated_reads',meanmodel=TRUE)
The above code finally works, still if I could use the reads_per_transcript file with my fold_changes I would be happy to do that, below is the dropbox links for both the transcriptome file and the .Rdata file having both reads_per_transcript
and the fold_changes
Please let me know if you can download the data else you can also try from below links
(https://drive.google.com/file/d/0B0Yo3SO6jsGGb2tlU1BuTlh6SmM/view?usp=sharing)
(https://drive.google.com/file/d/0B0Yo3SO6jsGGbXZKMWNucE5kVzg/view?usp=sharing)
Once you have downloaded please notify me I will remove them. Thanks
Hi @jtleek , Sorry to barge in , if you could let me know if you can use the data I hosted , then I would take it down, thanks? Once you have downloaded it and worked upon it , please let me know your opinions for the above stated problem, thanks a lot.
Feel free to take the links down, we got what we need to debug. Thanks!
Do we have any information?
Closing it as I just noticed this has been opened over 5 years now.
Hi,
I am trying to generate simulate reads from the human trascriptome but i am getting error with running the
simulate_experiment
. I am not sure where the error is. I calculated the length of the input transcript and it is more than the inputreads_per_transcript
and the inputfold_change
file am supplying. However am subsetting my transcript file to number of rows in thefold_change
file and the reads_per_transcript file. I am not sure where the error lies since. Below is my code and the error followed, can you let me know how to get rid of this.library(polyester)
library(Biostrings)
load("/scratch/GT/vdas/benchmark/dataForSimulation.Rdata",verbose=TRUE)
fasta_file = system.file('extdata', 'transcriptome.idx.fa', package='polyester')
count_transcripts(fasta_file)
## number of transcript is in this file is 50221length(reads_per_transcript)
length(foldchanges)
## 49628dim(foldchanges)
fasta = readDNAStringSet(fasta_file)
small_fasta = fasta[1:49628]
writeXStringSet(small_fasta, 'trancriptome_idx.fa')
readspertx = reads_per_transcript
simulate_experiment('trancriptome_idx.fa', reads_per_transcript=readspertx, num_reps=c(4,4), readlen=100, paired=TRUE, distr="empirical", error_model="illumina5", bias="rnaf", fold_changes=foldchanges, outdir='simulated_reads')
Error
Error in simulate_experiment("trancriptome_idx.fa", reads_per_transcript = 300, : fold_changes must be a vector with one entry per transcript in fasta or gtf; use count_transcripts to find out how many transcripts are in fasta or gtf.
I did a traceback