alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
89 stars 51 forks source link

Error with running simulate experiment for real human transcriptome #21

Closed vd4mmind closed 3 years ago

vd4mmind commented 9 years ago

Hi,

I am trying to generate simulate reads from the human trascriptome but i am getting error with running the simulate_experiment. I am not sure where the error is. I calculated the length of the input transcript and it is more than the input reads_per_transcript and the input fold_change file am supplying. However am subsetting my transcript file to number of rows in the fold_change file and the reads_per_transcript file. I am not sure where the error lies since. Below is my code and the error followed, can you let me know how to get rid of this.

library(polyester) library(Biostrings) load("/scratch/GT/vdas/benchmark/dataForSimulation.Rdata",verbose=TRUE) fasta_file = system.file('extdata', 'transcriptome.idx.fa', package='polyester') count_transcripts(fasta_file) ## number of transcript is in this file is 50221 length(reads_per_transcript) length(foldchanges) ## 49628 dim(foldchanges) fasta = readDNAStringSet(fasta_file) small_fasta = fasta[1:49628] writeXStringSet(small_fasta, 'trancriptome_idx.fa') readspertx = reads_per_transcript simulate_experiment('trancriptome_idx.fa', reads_per_transcript=readspertx, num_reps=c(4,4), readlen=100, paired=TRUE, distr="empirical", error_model="illumina5", bias="rnaf", fold_changes=foldchanges, outdir='simulated_reads')

Error Error in simulate_experiment("trancriptome_idx.fa", reads_per_transcript = 300, : fold_changes must be a vector with one entry per transcript in fasta or gtf; use count_transcripts to find out how many transcripts are in fasta or gtf.

I did a traceback

2: stop(.makepretty("fold_changes must be a vector with one entry per\n            transcript in fasta or gtf; use count_transcripts to find out how\n            many transcripts are in fasta or gtf."))
1: simulate_experiment("trancriptome_idx.fa", reads_per_transcript = 300, 
       num_reps = c(4, 4), readlen = 100, paired = TRUE, distr = "empirical", 
       error_model = "illumina5", bias = "rnaf", fold_changes = foldchanges, 
       outdir = "simulated_reads")```

I believe we can provide the reads per transcript and also the fold changes as I did. So where is it getting wrong. Can you tell me?
vd4mmind commented 9 years ago

I modified the code a bit to make my small_fasta have the same transcripts and same length as in the fold_change and in the reads_per_transcript but still I run with the same error. Do I need to convert the folchange file in matrix format?

load("/scratch/GT/vdas/benchmark/dataForSimulation.Rdata",verbose=TRUE) fasta_file = system.file('extdata', 'transcriptome.idx.fa', package='polyester') count_transcripts(fasta_file) ## number of transcript is in this file is 50221 length(reads_per_transcript) length(foldchanges) dim(foldchanges) fasta = readDNAStringSet(fasta_file)

to get rid of the error the transcript.idx.fa should have th same transcripts in fa, foldchanges and in reads_per_transcript files

sum(row.names(foldchanges) %in% names(fasta)) # 49273 foldchanges2 <- foldchanges[which(row.names(foldchanges) %in% names(fasta)),] dim(foldchanges2) #[1] 49273 8 small_fasta <- fasta[row.names(foldchanges2)] rpt2 <- reads_per_transcript[row.names(foldchanges2)] length(rpt2) ## 49273 writeXStringSet(small_fasta, 'trancriptome_idx.fa') readspertx = rpt2 simulate_experiment('trancriptome_idx.fa', reads_per_transcript=readspertx, num_reps=c(4,4), readlen=100, paired=TRUE, distr="empirical", error_model="illumina5", bias="rnaf", fold_changes=foldchanges2, outdir='simulated_reads')

Can you tell me why am still getting the same error now? Error Error in simulate_experiment("trancriptome_idx.fa", reads_per_transcript = readspertx, : fold_changes must be a vector with one entry per transcript in fasta or gtf; use count_transcripts to find out how many transcripts are in fasta or gtf.

vd4mmind commented 9 years ago

Hi @alyssafrazee I made the changes in my code and removed the transcripts which had 0 reads and fashioned my foldchange vector for the same transcripts that were there in the reads_per_transcript file. Then I reduced my fasta file and am keeping the same transcripts with the sequences that are there in the foldchanges file and the reads_per_transcripts file. Still I run into the error. Below is the code I made for making the above changes and it is followed by the erorr and the traceback. Can you let me know where am getting wrong.

library(polyester) library(Biostrings) load("/scratch/GT/vdas/benchmark/dataForSimulation.Rdata",verbose=TRUE) fasta_file = system.file('extdata', 'transcriptome.idx.fa', package='polyester') count_transcripts(fasta_file)## number of transcript is in this file is 50221 length(reads_per_transcript) length(foldchanges) dim(foldchanges) fasta = readDNAStringSet(fasta_file)

to get rid of the error the transcript.idx.fa should have th same transcripts in fa, foldchanges and in reads_per_transcript files

49273

rpt1<-reads_per_transcript[reads_per_transcript!=0] doldchanges1<-foldchanges[names(rpt1),] g <- names(rpt1) g <- g[which(g %in% names(fasta))] length(g) # [1] 18631 dim(doldchanges1)# [1] 18675 8 rpt1 <- rpt1[g] foldchanges1 <- foldchanges[g,] fasta2 <- fasta[g] dim(foldchanges1) # [1] 18631 8 length(rpt1) # [1] 18631 writeXStringSet(fasta2, 'trancriptome_clean_idx.fa') fcsample5<-foldchanges1$sample5 readspertx = rpt1 simulate_experiment('trancriptome_clean_idx.fa', reads_per_transcript=readspertx, num_reps=c(1,1), readlen=100, paired=TRUE, distr="empirical", error_model="illumina5", bias="rnaf", fold_changes=fcsample5, outdir='simulated_reads')

Below is the error Error in .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") : subscript out of bounds In addition: Warning message: In simulate_experiment("trancriptome_clean_idx.fa", reads_per_transcript = readspertx, : provided fold changes mean that one group will have more overall expression than the other, so some of the DE signal might be lost due to library size adjustment.

traceback() 6: .Call(.NAME, ..., PACKAGE = PACKAGE) 5: .Call2("XStringSet_unlist", x, PACKAGE = "Biostrings") 4: Biostrings::unlist(tFrags) 3: Biostrings::unlist(tFrags) 2: add_error(rctFrags, error_rate) 1: simulate_experiment("trancriptome_clean_idx.fa", reads_per_transcript = readspertx, num_reps = c(1, 1), readlen = 100, paired = TRUE, distr = "empirical", error_model = "illumina5", bias = "rnaf", fold_changes = fcsample5, outdir = "simulated_reads")

Can you tell me why is this error happening now? This problem is from the package Biostrings? But my transcript fasta, reads_per_transcript and foldchange file are all in the same order of fashion where the transcript ids are in same order.

alyssafrazee commented 9 years ago

Thanks for these reports. I'm traveling this week but will look into debugging when I get a chance.

vd4mmind commented 9 years ago

Sure , thanks a lot for the reply.

jtleek commented 9 years ago

Are the files small enough that you could post a small example of your reads_per_transcript and your fold_changes along with your fasta file? That way I can try to debug. I wonder if it is because the reads_per_transcript is the wrong class?

Best

Jeff

alyssafrazee commented 9 years ago

Echoing Jeff's comment above: I don't see anything obviously wrong with the code you provided, so we might need to actually see your fasta file and whatever is in reads_per_transcript to debug further. If you are able to share, that would be helpful! Could you also please post the output of sessionInfo()?

vd4mmind commented 9 years ago

Sorry I was away this weekend so could not reply. What I understood debugging on my own is one can only put a vector of FC changes for a group to induce the differential expression rather than for both the groups. Then the variability can be induced for groups but not within replicates. But that is fine I generated one vs one fasta files. The reads per transcript cannot be 0 in the count file of the reads_per_transcript . For every sequences having a length should have counts , there cannot be a transcript with 0 counts of read in the reads_per_transcript file. So I have to do away with the reads_per_transcript file keeping it empty and use the meanmodel=TRUE parameter. Below is the piece of code I have used.

length(fasta2[[1]]) #  [1] 1255 
lengths <- lapply(fasta2,FUN=length)
min(lengths)
sapply(head(fasta2),FUN=length)
lengths <- sapply(fasta2,FUN=length)
min(lengths)``` # [1] 168
w <- which(lengths > 500)
fasta3 <- fasta2[w]
foldchanges3 <- foldchanges1[w]
foldchanges3 <- foldchanges1[w,]
rpt3 <- rpt1[w]
writeXStringSet(fasta3, 'trancriptome_clean_idx.fa')
simulate_experiment('trancriptome_clean_idx.fa', reads_per_transcript=, num_reps=c(1,1), readlen=100, paired=TRUE, distr="empirical", error_model="illumina5", bias="rnaf", fold_changes=foldchanges3$sample5, outdir='simulated_reads',meanmodel=TRUE)

The above code finally works, still if I could use the reads_per_transcript file with my fold_changes I would be happy to do that, below is the dropbox links for both the transcriptome file and the .Rdata file having both reads_per_transcript and the fold_changes

(https://drive.google.com/open?id=0B0Yo3SO6jsGGfmFPSE4wWDdtNk5NTzY0d0pMOXBqcnNMUGV3eU0wQktlek10VzJhT2ItaVE)

(https://drive.google.com/folderview?id=0B0Yo3SO6jsGGfmFPSE4wWDdtNk5NTzY0d0pMOXBqcnNMUGV3eU0wQktlek10VzJhT2ItaVE&usp=sharing)

Please let me know if you can download the data else you can also try from below links

(https://drive.google.com/file/d/0B0Yo3SO6jsGGb2tlU1BuTlh6SmM/view?usp=sharing)

(https://drive.google.com/file/d/0B0Yo3SO6jsGGbXZKMWNucE5kVzg/view?usp=sharing)

Once you have downloaded please notify me I will remove them. Thanks

vd4mmind commented 9 years ago

Hi @jtleek , Sorry to barge in , if you could let me know if you can use the data I hosted , then I would take it down, thanks? Once you have downloaded it and worked upon it , please let me know your opinions for the above stated problem, thanks a lot.

alyssafrazee commented 9 years ago

Feel free to take the links down, we got what we need to debug. Thanks!

vd4mmind commented 8 years ago

Do we have any information?

vd4mmind commented 3 years ago

Closing it as I just noticed this has been opened over 5 years now.