alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
90 stars 51 forks source link

Invalid 'times' argument after simulate_experiment - works with subsetted dataset #65

Closed orangebromeliad closed 5 years ago

orangebromeliad commented 5 years ago

Hi team,

I'm trying to simulate RNAseq data for the malaria parasite P. berghei. The transcript file has 5254 transcripts, and I am trying to simulate two groups each with one replicate and no log fold changes between the two, for now. The following script does not work, producing the error reproduced below:

# FASTA annotation
fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta'
fasta = readDNAStringSet(fasta_file)

# ~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
# here all transcripts will have ~equal FPKM
readspertx = round(2 * width(fasta) / 100)

fold_changes = matrix(1,ncol=2, nrow=5254)

# simulation call:
simulate_experiment(fasta  = fasta_file,
                    reads_per_transcript=readspertx,
                    fold_changes=fold_changes,
                    num_reps=c(1,1),
                    paired = T,
                    reportCoverage = T,
                    readlen = 100,
                    outdir='polyester_pberghei_simulated_reads_twosample') 
Error in rep(seq_len(NROW(x)), ...) : invalid 'times' argument
In addition: Warning messages:
1: In rnbinom(n = length(basemeans), mu = basemeans, size = size) :
  NAs produced
2: In rnbinom(n = length(basemeans), mu = basemeans, size = size) :
  NAs produced

However when I use a subset of twenty transcripts the package appears to work without producing any errors:

# FASTA annotation
fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta'
fasta = readDNAStringSet(fasta_file)

# subset the FASTA file to first 20 transcripts
small_fasta = fasta[1:20]
writeXStringSet(small_fasta, 'berghei_small.fa')

# ~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
# here all transcripts will have ~equal FPKM
readspertx = round(2 * width(small_fasta) / 100)

transcript.number = count_transcripts('berghei_small.fa')
fold_changes = matrix(1, ncol = 2, nrow=20)

# simulation call:
simulate_experiment(fasta  = 'berghei_small.fa',
                    reads_per_transcript=readspertx,
                    fold_changes=fold_changes,
                    num_reps=c(1,1),
                    paired = T,
                    reportCoverage = T,
                    readlen = 100,
                    outdir='polyester_pberghei_simulated_reads_small') 

I don't believe that this is a duplicate of the earlier closed issue: https://github.com/alyssafrazee/polyester/issues/8 since I'm quite confident that the readspertx (reads per transcript) and fold_changes objects are the right dimensions. Hence I have been unable to find what the issue is.

Hope that this helps. Sessioninfo is below too:

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SGSeq_1.16.2                Rsamtools_1.34.1            Biostrings_2.50.2           XVector_0.22.0             
 [5] polyester_1.18.0            tibble_2.1.1                gridExtra_2.3               tidyselect_0.2.5           
 [9] readxl_1.3.1                data.table_1.12.0           scales_1.0.0                RColorBrewer_1.1-2         
[13] pheatmap_1.0.12             DESeq2_1.22.2               SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
[17] BiocParallel_1.16.6         matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0       
[21] GenomeInfoDb_1.18.2         IRanges_2.16.0              S4Vectors_0.20.1            BiocGenerics_0.28.0        
[25] ggplot2_3.1.0               tidyr_0.8.3                 dplyr_0.8.0.1              

loaded via a namespace (and not attached):
 [1] bitops_1.0-6             bit64_0.9-7              httr_1.4.0               progress_1.2.0          
 [5] tools_3.5.2              backports_1.1.3          utf8_1.1.4               R6_2.4.0                
 [9] rpart_4.1-13             Hmisc_4.2-0              DBI_1.0.0                lazyeval_0.2.2          
[13] colorspace_1.4-1         nnet_7.3-12              withr_2.1.2              prettyunits_1.0.2       
[17] bit_1.1-14               compiler_3.5.2           cli_1.1.0                htmlTable_1.13.1        
[21] rtracklayer_1.42.2       labeling_0.3             checkmate_1.9.1          genefilter_1.64.0       
[25] stringr_1.4.0            digest_0.6.18            foreign_0.8-71           base64enc_0.1-3         
[29] pkgconfig_2.0.2          htmltools_0.3.6          limma_3.38.3             htmlwidgets_1.3         
[33] rlang_0.3.2              rstudioapi_0.10          RSQLite_2.1.1            acepack_1.4.1           
[37] RCurl_1.95-4.12          magrittr_1.5             GenomeInfoDbData_1.2.0   Formula_1.2-3           
[41] Matrix_1.2-16            Rcpp_1.0.1               munsell_0.5.0            fansi_0.4.0             
[45] stringi_1.4.3            logspline_2.1.12         yaml_2.2.0               zlibbioc_1.28.0         
[49] plyr_1.8.4               grid_3.5.2               blob_1.1.1               crayon_1.3.4            
[53] lattice_0.20-38          splines_3.5.2            GenomicFeatures_1.34.6   annotate_1.60.1         
[57] hms_0.4.2                locfit_1.5-9.1           knitr_1.22               pillar_1.3.1            
[61] igraph_1.2.4             RUnit_0.4.32             biomaRt_2.38.0           geneplotter_1.60.0      
[65] reshape2_1.4.3           XML_3.98-1.19            glue_1.3.1               latticeExtra_0.6-28     
[69] BiocManager_1.30.4       cellranger_1.1.0         gtable_0.3.0             purrr_0.3.2             
[73] assertthat_0.2.1         xfun_0.5                 xtable_1.8-3             survival_2.43-3         
[77] GenomicAlignments_1.18.1 AnnotationDbi_1.44.0     memoise_1.1.0            cluster_2.0.7-1 

Best, Tim L.

alyssafrazee commented 5 years ago

My guess here is that some values of readspertx are zero or NA (but that there aren't any zeros/NAs in the first 20 transcripts / small simulation). Sadly this is a known limitation: https://github.com/alyssafrazee/polyester/issues/28. However, you can work around this by either removing the transcripts you don't want to simulate from, or replacing all the 0 counts with some small number (like 1) to approximate the simulation. Hope this helps!

On Tue, Mar 26, 2019 at 4:39 AM orangebromeliad notifications@github.com wrote:

Hi team,

I'm trying to simulate RNAseq data for the malaria parasite P. berghei. The transcript file has 5254 transcripts, and I am trying to simulate two groups each with one replicate and no log fold changes between the two, for now. The following script does not work, producing the error reproduced below:

FASTA annotation

fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta' fasta = readDNAStringSet(fasta_file)

~20x coverage ----> reads per transcript = transcriptlength/readlength * 20

here all transcripts will have ~equal FPKM

readspertx = round(2 * width(fasta) / 100)

fold_changes = matrix(1,ncol=2, nrow=5254)

simulation call:

simulate_experiment(fasta = fasta_file, reads_per_transcript=readspertx, fold_changes=fold_changes, num_reps=c(1,1), paired = T, reportCoverage = T, readlen = 100, outdir='polyester_pberghei_simulated_reads_twosample')

Error in rep(seq_len(NROW(x)), ...) : invalid 'times' argument In addition: Warning messages: 1: In rnbinom(n = length(basemeans), mu = basemeans, size = size) : NAs produced 2: In rnbinom(n = length(basemeans), mu = basemeans, size = size) : NAs produced

However when I use a subset of twenty transcripts the package appears to work without producing any errors:

FASTA annotation

fasta_file = '/Users/littlet/Downloads/PlasmoDB-42_PbergheiANKA_AnnotatedTranscripts.fasta' fasta = readDNAStringSet(fasta_file)

subset the FASTA file to first 20 transcripts

small_fasta = fasta[1:20] writeXStringSet(small_fasta, 'berghei_small.fa')

~20x coverage ----> reads per transcript = transcriptlength/readlength * 20

here all transcripts will have ~equal FPKM

readspertx = round(2 * width(small_fasta) / 100)

transcript.number = count_transcripts('berghei_small.fa') fold_changes = matrix(1, ncol = 2, nrow=20)

simulation call:

simulate_experiment(fasta = 'berghei_small.fa', reads_per_transcript=readspertx, fold_changes=fold_changes, num_reps=c(1,1), paired = T, reportCoverage = T, readlen = 100, outdir='polyester_pberghei_simulated_reads_small')

I don't believe that this is a duplicate of the earlier closed issue: #8 https://github.com/alyssafrazee/polyester/issues/8 since I'm quite confident that the readspertx (reads per transcript) and fold_changes objects are the right dimensions. Hence I have been unable to find what the issue is.

Hope that this helps. Sessioninfo is below too:

R version 3.5.2 (2018-12-20) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Sierra 10.12.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] SGSeq_1.16.2 Rsamtools_1.34.1 Biostrings_2.50.2 XVector_0.22.0 [5] polyester_1.18.0 tibble_2.1.1 gridExtra_2.3 tidyselect_0.2.5 [9] readxl_1.3.1 data.table_1.12.0 scales_1.0.0 RColorBrewer_1.1-2 [13] pheatmap_1.0.12 DESeq2_1.22.2 SummarizedExperiment_1.12.0 DelayedArray_0.8.0 [17] BiocParallel_1.16.6 matrixStats_0.54.0 Biobase_2.42.0 GenomicRanges_1.34.0 [21] GenomeInfoDb_1.18.2 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 [25] ggplot2_3.1.0 tidyr_0.8.3 dplyr_0.8.0.1

loaded via a namespace (and not attached): [1] bitops_1.0-6 bit64_0.9-7 httr_1.4.0 progress_1.2.0 [5] tools_3.5.2 backports_1.1.3 utf8_1.1.4 R6_2.4.0 [9] rpart_4.1-13 Hmisc_4.2-0 DBI_1.0.0 lazyeval_0.2.2 [13] colorspace_1.4-1 nnet_7.3-12 withr_2.1.2 prettyunits_1.0.2 [17] bit_1.1-14 compiler_3.5.2 cli_1.1.0 htmlTable_1.13.1 [21] rtracklayer_1.42.2 labeling_0.3 checkmate_1.9.1 genefilter_1.64.0 [25] stringr_1.4.0 digest_0.6.18 foreign_0.8-71 base64enc_0.1-3 [29] pkgconfig_2.0.2 htmltools_0.3.6 limma_3.38.3 htmlwidgets_1.3 [33] rlang_0.3.2 rstudioapi_0.10 RSQLite_2.1.1 acepack_1.4.1 [37] RCurl_1.95-4.12 magrittr_1.5 GenomeInfoDbData_1.2.0 Formula_1.2-3 [41] Matrix_1.2-16 Rcpp_1.0.1 munsell_0.5.0 fansi_0.4.0 [45] stringi_1.4.3 logspline_2.1.12 yaml_2.2.0 zlibbioc_1.28.0 [49] plyr_1.8.4 grid_3.5.2 blob_1.1.1 crayon_1.3.4 [53] lattice_0.20-38 splines_3.5.2 GenomicFeatures_1.34.6 annotate_1.60.1 [57] hms_0.4.2 locfit_1.5-9.1 knitr_1.22 pillar_1.3.1 [61] igraph_1.2.4 RUnit_0.4.32 biomaRt_2.38.0 geneplotter_1.60.0 [65] reshape2_1.4.3 XML_3.98-1.19 glue_1.3.1 latticeExtra_0.6-28 [69] BiocManager_1.30.4 cellranger_1.1.0 gtable_0.3.0 purrr_0.3.2 [73] assertthat_0.2.1 xfun_0.5 xtable_1.8-3 survival_2.43-3 [77] GenomicAlignments_1.18.1 AnnotationDbi_1.44.0 memoise_1.1.0 cluster_2.0.7-1

Best, Tim L.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alyssafrazee/polyester/issues/65, or mute the thread https://github.com/notifications/unsubscribe-auth/AB6e2Ct3PNF5h3esJ4Wp730AruYljTrmks5vagbdgaJpZM4cLObX .

orangebromeliad commented 5 years ago

Hi Alyssa,

Yes you are right; it was because of the use of the round function when calculating the reads-per-transcript, if I remove it then the problem goes away. I'm using a genome which has some very short transcripts which were reduced to 0.5 or smaller reads per transcript by the calculation and round was turning this to zero.

Would it be worth inserting an error into the code to check for this an inform of the problem? Something like:

if(any(reads_per_transcript <= 0)){error('Reads_per_transcript contains zero or negative values')} ?

Best wishes and thank you for your help, Tim,