Closed orangebromeliad closed 5 years ago
My guess here is that some values of readspertx
are zero or NA (but that
there aren't any zeros/NAs in the first 20 transcripts / small simulation).
Sadly this is a known limitation:
https://github.com/alyssafrazee/polyester/issues/28. However, you can work
around this by either removing the transcripts you don't want to simulate
from, or replacing all the 0 counts with some small number (like 1) to
approximate the simulation. Hope this helps!
On Tue, Mar 26, 2019 at 4:39 AM orangebromeliad notifications@github.com wrote:
Hi team,
I'm trying to simulate RNAseq data for the malaria parasite P. berghei. The transcript file has 5254 transcripts, and I am trying to simulate two groups each with one replicate and no log fold changes between the two, for now. The following script does not work, producing the error reproduced below:
FASTA annotation
fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta' fasta = readDNAStringSet(fasta_file)
~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
here all transcripts will have ~equal FPKM
readspertx = round(2 * width(fasta) / 100)
fold_changes = matrix(1,ncol=2, nrow=5254)
simulation call:
simulate_experiment(fasta = fasta_file, reads_per_transcript=readspertx, fold_changes=fold_changes, num_reps=c(1,1), paired = T, reportCoverage = T, readlen = 100, outdir='polyester_pberghei_simulated_reads_twosample')
Error in rep(seq_len(NROW(x)), ...) : invalid 'times' argument In addition: Warning messages: 1: In rnbinom(n = length(basemeans), mu = basemeans, size = size) : NAs produced 2: In rnbinom(n = length(basemeans), mu = basemeans, size = size) : NAs produced
However when I use a subset of twenty transcripts the package appears to work without producing any errors:
FASTA annotation
fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta' fasta = readDNAStringSet(fasta_file)
subset the FASTA file to first 20 transcripts
small_fasta = fasta[1:20] writeXStringSet(small_fasta, 'berghei_small.fa')
~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
here all transcripts will have ~equal FPKM
readspertx = round(2 * width(small_fasta) / 100)
transcript.number = count_transcripts('berghei_small.fa') fold_changes = matrix(1, ncol = 2, nrow=20)
simulation call:
simulate_experiment(fasta = 'berghei_small.fa', reads_per_transcript=readspertx, fold_changes=fold_changes, num_reps=c(1,1), paired = T, reportCoverage = T, readlen = 100, outdir='polyester_pberghei_simulated_reads_small')
I don't believe that this is a duplicate of the earlier closed issue: #8 https://github.com/alyssafrazee/polyester/issues/8 since I'm quite confident that the readspertx (reads per transcript) and fold_changes objects are the right dimensions. Hence I have been unable to find what the issue is.
Hope that this helps. Sessioninfo is below too:
R version 3.5.2 (2018-12-20) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages: [1] SGSeq_1.16.2 Rsamtools_1.34.1 Biostrings_2.50.2 XVector_0.22.0 [5] polyester_1.18.0 tibble_2.1.1 gridExtra_2.3 tidyselect_0.2.5 [9] readxl_1.3.1 data.table_1.12.0 scales_1.0.0 RColorBrewer_1.1-2 [13] pheatmap_1.0.12 DESeq2_1.22.2 SummarizedExperiment_1.12.0 DelayedArray_0.8.0 [17] BiocParallel_1.16.6 matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0 [21] GenomeInfoDb_1.18.2 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 [25] ggplot2_3.1.0 tidyr_0.8.3 dplyr_0.8.0.1
loaded via a namespace (and not attached): [1] bitops_1.0-6 bit64_0.9-7 httr_1.4.0 progress_1.2.0 [5] tools_3.5.2 backports_1.1.3 utf8_1.1.4 R6_2.4.0 [9] rpart_4.1-13 Hmisc_4.2-0 DBI_1.0.0 lazyeval_0.2.2 [13] colorspace_1.4-1 nnet_7.3-12 withr_2.1.2 prettyunits_1.0.2 [17] bit_1.1-14 compiler_3.5.2 cli_1.1.0 htmlTable_1.13.1 [21] rtracklayer_1.42.2 labeling_0.3 checkmate_1.9.1 genefilter_1.64.0 [25] stringr_1.4.0 digest_0.6.18 foreign_0.8-71 base64enc_0.1-3 [29] pkgconfig_2.0.2 htmltools_0.3.6 limma_3.38.3 htmlwidgets_1.3 [33] rlang_0.3.2 rstudioapi_0.10 RSQLite_2.1.1 acepack_1.4.1 [37] RCurl_1.95-4.12 magrittr_1.5 GenomeInfoDbData_1.2.0 Formula_1.2-3 [41] Matrix_1.2-16 Rcpp_1.0.1 munsell_0.5.0 fansi_0.4.0 [45] stringi_1.4.3 logspline_2.1.12 yaml_2.2.0 zlibbioc_1.28.0 [49] plyr_1.8.4 grid_3.5.2 blob_1.1.1 crayon_1.3.4 [53] lattice_0.20-38 splines_3.5.2 GenomicFeatures_1.34.6 annotate_1.60.1 [57] hms_0.4.2 locfit_1.5-9.1 knitr_1.22 pillar_1.3.1 [61] igraph_1.2.4 RUnit_0.4.32 biomaRt_2.38.0 geneplotter_1.60.0 [65] reshape2_1.4.3 XML_3.98-1.19 glue_1.3.1 latticeExtra_0.6-28 [69] BiocManager_1.30.4 cellranger_1.1.0 gtable_0.3.0 purrr_0.3.2 [73] assertthat_0.2.1 xfun_0.5 xtable_1.8-3 survival_2.43-3 [77] GenomicAlignments_1.18.1 AnnotationDbi_1.44.0 memoise_1.1.0 cluster_2.0.7-1
Best, Tim L.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alyssafrazee/polyester/issues/65, or mute the thread https://github.com/notifications/unsubscribe-auth/AB6e2Ct3PNF5h3esJ4Wp730AruYljTrmks5vagbdgaJpZM4cLObX .
Hi Alyssa,
Yes you are right; it was because of the use of the round
function when calculating the reads-per-transcript, if I remove it then the problem goes away. I'm using a genome which has some very short transcripts which were reduced to 0.5 or smaller reads per transcript by the calculation and round was turning this to zero.
Would it be worth inserting an error into the code to check for this an inform of the problem? Something like:
if(any(reads_per_transcript <= 0)){error('Reads_per_transcript contains zero or negative values')}
?
Best wishes and thank you for your help, Tim,
Hi team,
I'm trying to simulate RNAseq data for the malaria parasite P. berghei. The transcript file has 5254 transcripts, and I am trying to simulate two groups each with one replicate and no log fold changes between the two, for now. The following script does not work, producing the error reproduced below:
However when I use a subset of twenty transcripts the package appears to work without producing any errors:
I don't believe that this is a duplicate of the earlier closed issue: https://github.com/alyssafrazee/polyester/issues/8 since I'm quite confident that the readspertx (reads per transcript) and fold_changes objects are the right dimensions. Hence I have been unable to find what the issue is.
Hope that this helps. Sessioninfo is below too:
Best, Tim L.