BioInf-Wuerzburg / proovread

PacBio hybrid error correction through iterative short read consensus
MIT License
60 stars 20 forks source link

The entry of RS_IsoSeq pipeline to use proovread for correction? #51

Closed raechin closed 8 years ago

raechin commented 8 years ago

I'm sorry if you got my emails (I'm not sure whether my email was sent successfully or not).

I've six SMRT cells and some Hiseq NGS data from transcriptom sequencing for a species without ref. genome. I am planning to use RS_IsoSeq pipeline to get polished full length reads by:

  1. ConsensusTools.sh CircularConsensus ... to get reads of insert (css reads)
  2. pbtranscript.py classify ... to get full length reads.
  3. pbtranscript.py cluster ... to do Isoform level clustering and get consensus reads.
  4. ice_polish.py ... to run Quiver to get polished consensus reads.

After polished by quiver process, I will get low quality and high quality consensus sequences. I learned from [https://github.com/BioInf-Wuerzburg/proovread/issues/41] that he tried to use proovread to correct low quality consensus sequences from Quiver, but I don't know what is going on, whether it was successful or not.

My question is: Can I use proovread to correct both high and especially Low quality consensus sequences? I'm also a bit confused that if I corrected PACbio reads from step 1 (css reads) (this is normally used for proovread input, right?), then can I still use RS_IsoSeq pipeline for subsequent analysis, or should I use other tools (what?)?

I'm really new on PACbio analysis. Sincerely hope you could give me some suggestions.

Thank you!

thackl commented 8 years ago

You can use proovread to try and further corrected hi- as well as low-consensus sequences. However, your hi-quality data should already be pretty error free. Use --no-sampling for transcriptome data runs.

Usual input for proovread are "reads of inserts" (also referred to subreads). Whether or not subreads from the template have already been merged into a single consensus (ccs) or not does not matter, as proovread will do that if necessary.

If you run proovread correction after step 1, I think 2,3 and 4 should still be possible. However, I have not used the pipeline myself. In particular, I do not know, how full length reads are determined, and whether that might be affected by prior correction with proovread. Clustering should work, yet the differences between reads should be much smaller, since most errors have already been removed.

thackl commented 8 years ago

The most interesting reads to correct are probably reads deemed full-length, but which did not get a lot o hits during clustering, i.e. rare transcripts with low pacbio coverage. Not sure though, if you can get that information from your data.

raechin commented 8 years ago

Thank you very much, thackl! I'll try correction for both ccs subreads and quiver low quality sequences if I can. I tested proovread for one small piece of file, it seems all right. But before running proovread for the whole data, could you help me to estimate how many short reads I need and how long will it take to run proovread?

I have got six reads of insert (ccs reads), each with ~140Mb (for example, one file has 67000seqs_2100bp). The final quiver LQ sequences for the pooled six files is also ~140Mb (2658bp_58260seqs).

My species has no reference genome, and the estimated genome size is ~3Gb. What is the minmum required short reads for correction? Do I need 3Gb_50x of short reads? How long will it take if I use 3Gb_50X to correct the 140Mb long reads?

Thank you!

thackl commented 8 years ago

Just to clarify - are you planning to correct your transcriptome PacBio reads with genomic Illumina reads?

raechin commented 8 years ago

Both my pacbio and NGS are from transcriptome sequencing.