alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
90 stars 51 forks source link

strand specific argument #40

Open dn070017 opened 8 years ago

dn070017 commented 8 years ago

Hello, developers,

Recently, I'm trying to perform a simulation with strand specific reads.

Because I can't find any information about what kind of library will be constructed, so I decided to do a quick test with yeast cdna.

The code I used are basically shown below (but for memory issue, I have to split the original reference cdna into many different files):

fasta_count = count_transcripts('test.fasta') fasta = readDNAStringSet('test.fasta') writeXStringSet('test.fasta', 'test.file') readspertx = round(50 * (width(fasta) / 200)) fold_change = data.frame(value=matrix(sample(seq(from=0.01, to=10.00, by=0.01), fasta_count, replace=T), fasta_count)) simulate_experiment('test.file', reads_per_transcript=readspertx, num_reps=c(1,1), fold_changes=fold_change, error_model='illumina5', bias='cdnaf', outdir=folder, strand_specific=T)

The simulation was successful without any error message.

Then I mapped the reads back to the reference cdna and tried to determine the expression profile of each sequences. (using bowtie2 and express)

Because I need to check what kind of libraries were used in the simulation model, I ran express on three different ways (which only calculate the reads with specific library protocols)

express -o ./sample -m 250 -s 25 reference.fasta sample.bam (unstranded) express -o ./sample_fr -m 250 -s 25 --fr-stranded reference.fasta sample.bam (fr-stranded) express -o ./sample_rf -m 250 -s 25 --rf-stranded reference.fasta sample.bam (rf-stranded)

What surprised me is the total count or effective count for sample_fr and sample_rf are only half of the unstranded one The result indicate that there are 50% of reads are fr-stranded, and others are rf-stranded, which suggest the simulation still produce unstranded reads.

I'm wondering if I use the program in a wrong way or do I misunderstand anything.

By the way, thanks for this amazing tools! it really helps me a lot on my research

Best, Ping-Han

alyssafrazee commented 8 years ago

Hi Ping-Han,

When you aligned your reads with bowtie (to create the sample.bam) files, did you provide a strand-specific argument there? From what I remember, you do have to tell bowtie that your reads are stranded (it is not the default and it doesn't infer it from the fasta input files). That's the only thing I can think of that might be going wrong outside of polyester.

@mikelove, do you have a sense of whether something might be going on with the strand-specific simulation? I can dig in when I get a few more spare moments, but thought I'd see if anything came to mind.

mikelove commented 8 years ago

@dn070017 Can you provide sessionInfo()? Also, my first check would be to just look at the aligned reads with IGV. You can color by strand.

dn070017 commented 8 years ago

The sessionInfo() are described as follow: R version 3.3.1 (2016-06-21) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] polyester_1.8.3 Biostrings_2.40.2 XVector_0.12.1 [4] IRanges_2.6.1 S4Vectors_0.10.3 BiocGenerics_0.18.0

loaded via a namespace (and not attached): [1] zlibbioc_1.18.0

I used bwa (which estimate the orientation and insert size on its own) to do the alignment as well. By the way, in my own experience, I think the mapper such as bwa or bowtie2 do not take the library type into account. They seems to consider only the orientation of the reads to be whether inward (PE) or outward (MP). After converting the result to sam format and sorting appropriately, I passed the result to IGV. Here's the command I used: bwa mem -t 20 ../../yeast_cdna/Saccharomyces_cerevisiae_cdna_500_bwa ./sample_01_1.fastq ./sample_01_2.fastq > sample_01.sam samtools view -@ 20 -b sample_01.sam | samtools sort - -o sample_01.bam samtools index sample_01.bam sample_01.bai

The result are shown as follows: image

I've sorted the alignments by which strand they were aligned. So the alignments above the red line were aligned to the negative strand of cdna, those below the red line were aligned to the positive strand. And I also colored the alignments by the first-of-pair strand. For the alignments mapped to the negative strand of cdna, the purple ones suggest they are the first read in the pair (w/ negative insert size), and the red ones the second read in the pair. However, when it comes to the alignments mapped on the positive strand of cdna, the red ones suggest they are the first read in the pair, and the purple ones the second. The result that I expect is the reads which are first read in the pair should always aligned to specific strand of cdna, and the second read in the pair to another. (e.g for library like fr-firststrand, the first reads in the pair should always mapped on the negative strand of cdna or transcript)

I'm still new to bioinformatics, so if I misunderstand anything, just feel free to let me know. Thanks for the reply.

mikelove commented 8 years ago

The argument 'strand_specific' is in the devel branch (Polyester 1.9), this is release.

Check the help pages in R, and you should see that 'stand_specific' is not listed in the Details:

?simulate_experiment