alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
89 stars 51 forks source link

simulate_experiment discrepancy in basemean values for simulated data #80

Open vavouri-lab opened 2 years ago

vavouri-lab commented 2 years ago

Hi there,

I have a question - potentially bug report - for the simulate_experiment function.

I am using polyester to simulated RNA-seq data for a set of features (GTF file), a genome (fasta file), a set of expression values (vector of expression values in the same order as in the GTF file) and a matrix of fold changes (following the same order as the GTF and the expression vector) for an experiment with 2 conditions and 2 replicates:

simulate_experiment( seqpath = "../data/mygenomedir/", reads_per_transcript = mymeanexpressionvalues, fold_changes = myfoldchangematrix, feature = "exon", gtf = "../data/mytest.gtf", transcriptid = myfeatureIDs, num_reps = c(2,2) )

Polyester seems to run fine but the features with expression change between conditions do not correspond to the ones I set them to be. This looks like a bug to me but I am also slightly unsure if I am using polyester correctly. Specifically, other than providing expression values and fold_changes in the same order as the order of features in the GTF file, I don't see how to set which features to be differentially expressed in my simulated experiment.

Looking through the R code, I see that simulate_experiment() calls seq_gtf() which gets the sequence of the features in the GTF file. I see that on line 114 of seq_gtf, the split function in R reorders the list of features according to alphanumeric order. I think that this reordering causes the expression and fold change values to be assigned to different features to the ones I want. Is this a bug or have I misunderstood how to set expression to specific features?

Thanks Tanya