alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
89 stars 51 forks source link

Simulated strand-specific reads were weird #61

Closed yh549848 closed 5 years ago

yh549848 commented 5 years ago

Hi,

Now, I'm trying to generate simulation reads with the strand-specific parameter.

The code I executed is below:

#
# Generate simulation data from kallisto's estimated abundance using polyester
#

# installation
# source("https://bioconductor.org/biocLite.R")
# biocLite('Biostrings')
# biocLite('ballgown')
# biocLite('polyester')

# requirements
library(tidyverse)
library(readr)
library(data.table)
library(tools)

library(Biostrings)
library(ballgown)
library(polyester)

# load transcript seqs
fasta_path <- './references/gencode.v29.transcripts.fa'
fasta <- fasta_path %>% readDNAStringSet
transcript_ids_ordered <- fasta %>% names

# load counts and shape to matrix
readmat <- matrix(nrow = length(transcript_ids_ordered), ncol = length(input_paths))
for (i in 1:length(input_paths)) {
  path <- input_paths[i]
  abundances <- path %>% fread(sep = '\t')
  readmat[, i] <- abundances[abundances$target_id == transcript_ids_ordered] %>% .$est_counts
}
readmat <- readmat %>% round

# generate reads
## params
PAIRED          <- TRUE
READLEN         <- 100 # default
ERROR_RATE      <- 0.005 # default
BIAS            <- 'rnaf'
STRAND_SPECIFIC <- TRUE # CAUTION: this parameter will generate FR directed reads (NOT RF)
SEED            <- 12345
OUTDIR          <- './reads_simulated'

## simulate reads by count matrix
simulate_experiment_countmat(fasta = fasta_path,
                             readmat = readmat, paired = PAIRED,
                             readlen = READLEN, error_rate = ERROR_RATE,
                             bias = BIAS, strand_specific = STRAND_SPECIFIC,
                             seed = SEED, outdir = OUTDIR)

And then I performed alignment to reference sequence. The result (IGV) is below:

screen shot 2019-01-30 at 17 40 22

Periodic mountains (high depth) and valleys (low depth) can be seen on that result. Is this normal?

Best, YH

alyssafrazee commented 5 years ago

Hi there, this is a place to report problems or bugs with the ballgown software. This looks like you're looking for a review of your research (or possibly your code). We recommend asking a collaborator at your institution for this type of thing, or you might get a good answer on support.bioconductor.org using the "polyester" tag (for other folks that might have seen this).

If you do find a specific software bug, where the reads are generated incorrectly, please feel free to open an issue. Thanks and good luck!

yh549848 commented 5 years ago

Hi, thanks for your reply. Sorry to confuse, I do not require to be reviewed. I think this problem caused by generate_fragments() with bias option. At long transcript (e.g. NEAT1; > 20 kbp), start_pos becames discrete because cdna/rna_pos_bias configuration file have only 100 rows. Therefore, reads which have peak-peak distance approximately 200 bp will obtain. This is so differenced from real read data.

generate_fragments.R (extracted)

    if(bias == 'none'){
        start_pos = floor(runif(n, min=rep(1,n), max=L[s]-fraglens[s]+2))
    }else if(bias == 'rnaf'){
        data(rnaf)
        starts_pct = sample(rnaf$pospct, size=n, prob=rnaf$prob, replace=TRUE)
        starts_pct[starts_pct==1] = 0.999
        start_pos = floor(starts_pct * (L[s]-fraglens[s]+2))
        start_pos[start_pos==0] = 1
    }else{
        # bias == 'cdnaf'
        data(cdnaf)
        starts_pct = sample(cdnaf$pospct, size=n, prob=cdnaf$prob, replace=TRUE)
        starts_pct[starts_pct==1] = 0.999
        start_pos = floor(starts_pct * (L[s]-fraglens[s]+2))
        start_pos[start_pos==0] = 1
    }

Thanks!