alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
89 stars 51 forks source link

"negative length vectors are not allowed" ? #41

Closed ireneg closed 7 years ago

ireneg commented 7 years ago

Hi guys,

I'm trying to simulate sequence under a bunch of different conditions. The vast majority of them work, but I've been hitting an uninformative error message with others:

extraVars <- commandArgs(trailingOnly=T)
#extraVars <- c("ortho_FALSE", 0.93, 50) # for testing only

library(polyester)
library(Biostrings)

setwd("~/orthoexon/simNGS/hg38_macFas5")

# FASTA annotation
hsFastaPath <- paste0(getwd(), "/fasta/hg38_ensembl84_orthoexon_", extraVars[2], "pc_", extraVars[1], ".fa")
hsFasta <- readDNAStringSet(hsFastaPath)

# ~20x coverage ----> reads per transcript = transcriptlength/readlength * 20
# here all transcripts will have ~equal FPKM
hsDepth <- round(20 * width(hsFasta) / 100)

# Simulate humans:
dir.create(file.path(paste0(getwd(), "/polyester/", extraVars[1], "_", extraVars[2], "_", extraVars[3], "bp/hg38")), showWarnings = F, recursive=T)
    simulate_experiment(hsFastaPath, reads_per_transcript=hsDepth, num_reps=10, fold_changes=matrix(1, nrow=length(hsFasta)), outdir=paste0(getwd(), "/polyester/", extraVars[1], "_", extraVars[2], "_", extraVars[3], "bp/hg38"), readlen=as.numeric(extraVars[3]), paired=F, seed=round(11051984*as.numeric(extraVars[2])))

Often times, simulate_experiment fails fairly quickly with this error message:

Error in .Call(.NAME, ..., PACKAGE = PACKAGE) : 
  negative length vectors are not allowed

I launch many jobs with the same script, for multiple values of extraVars[2], and only some are repeatedly and predictably failing; extraVars[3] is always either 50 or 100. I realise the problem is probably coming from my fasta input, but the files I use differ by thousands of lines and checking them manually is not really going to work, so any intuition you might have as to what is causing the error would really help me clean them up - thank you!!

(And, not to be captain obvious here, but here are the lengths for one of the failed runs:

summary(width(hsFasta))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     30     293     793    1486    2090   98920 

summary(hsDepth)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    6.0    59.0   159.0   297.1   418.0 19780.0 

there are no obvious negative lengths that I can find.)

Oh, also - for some reason this command yields 20 replicates instead of 10 per run of simulate_experiment. Any idea why? Not that I don't appreciate having twice as many samples as I need, I just couldn't find the answer in the manual.

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Biostrings_2.40.2   XVector_0.12.1      IRanges_2.6.1      
[4] S4Vectors_0.10.3    BiocGenerics_0.18.0 polyester_1.8.3    

loaded via a namespace (and not attached):
[1] zlibbioc_1.18.0
alyssafrazee commented 7 years ago

The negative length vectors not allowed error message means your R session has run out of memory. You can either try to run the simulation on a machine with more memory, or you could break up the input files into smaller input files and do partial simulations; then concatenate the output.

As for the replicates question, it looks like you're using the current release version (1.8.3), which per the docs on num_reps is the number of replicates in each group -- so if it's a single integer, you're simulating 2 groups of 10 replicates each (20 replicates).

That said, the behavior should be that the length of the vector determines the number of groups, as of a refactor I made in Dec 2014 :/ I will check on why the release version is so outdated. In the meantime, you can install the github version using the install_github function from devtools: install_github('alyssafrazee/polyester').

ireneg commented 7 years ago

Thank you so much on both counts! I hadn't even thought about memory but our cluster is very weird in how it assigns it, so I might not be getting nearly as much as I thought, which would explain things, especially since the files that work are subsets of the ones that fail.

And yes, I had looked at your example code here on github and it seemed that you specified two groups explicitly, so that also didn't occur to me. I was just wondering where my extra factor of 2 was coming from!

Thanks again! Back to running all my other simulations now :)