alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
90 stars 51 forks source link

Only outputting one file? #64

Open theheking opened 5 years ago

theheking commented 5 years ago

Hi,

I am running polyester using the following code:

simulate_experiment(EnsemblfastaFile, numreps=c(1,1),fold_changes=fold_changes, reads_per_transcript=readspertx, paired=FALSE, outdir="/mnt/lustre/users/k1632479/polyester/simulatedread", distr="empirical", error_model="illumina5", bias="rnaf")

It runs for hours and outputs a large 8 GB fasta file. I don't understand what I'm running wrong.

JMF47 commented 5 years ago

It’s likely that this is an input issue and would be more appropriate on the bioconductor user forum.

What does your readspertx look like?

theheking commented 5 years ago

If it is an input issue, I don't know why I would get a output fasta file for the first sample_01.fasta which is 14G.

I have already removed the zeroes from the readspertx. So I don't think it is that. `fasta_File <- readDNAStringSet(EnsemblfastaFile)

replace all 0 counts with 1

fastaFile_nozero <- replace(width(fasta_File), width(fasta_File) == 0, 1)

readspertx = round(20 * fastaFile_nozero / 100)`

JMF47 commented 5 years ago

What does the following show:

head(readspertx) sum(readspertx)

theheking commented 5 years ago

head(readspertx) [1] 214 22 831 598 727 96 sum(readspertx) [1] 47182274

JMF47 commented 5 years ago

If I’m understanding your query and the output you’ve shared with me, the output file size is within the realm of expectation. The input specified generates one file per sample, since paired=F. If I recall correctly, 47 million 100bp reads stored in fasta format should be >10Gb. If you could let me know a bit more about what your goal of the simulation is (such as simulating 20x coverage of every transcript in the transcriptome) I could try and help you tweak your commands.