Magdoll / Cogent

Coding Genome Reconstruction using Iso-Seq data
BSD 3-Clause Clear License
60 stars 17 forks source link

Confusion about input file of Cogent #7

Closed linglingtingfei closed 7 years ago

linglingtingfei commented 7 years ago

Hi Liz, I just want to confirm if input file of Cogent is full-length transcript or both FL reads and non-FL reads? you know, after running collapsing script, representative sequences for each unique isoform can be generated. Should I consider it as input file for Cogent? Actually, I'm always confused that ARE they all full length transcripts in representative FASTA file? if NO, how to filer out FL transcripts?

I'm looking forward your reply. Thank you so much in advance.

-Ling

Magdoll commented 7 years ago

In short: neither FL nor nFL CCS reads.

The input to Cogent should be after running Iso-Seq clustering, the high-quality (HQ) consensus isoform sequences. The filename is usually something like all_sizes.quivered_hq.fasta.

Cogent operates on the assumption that the sequences should have less than 1% errors. Neither the FL nor nFL CCS reads (using the default Iso-Seq classify criteria, which uses any CCS reads above 85% accuracy) fulfill that assumption.

fantastycrane commented 6 years ago

Hi @Magdoll , While i was runnning Cogent , all_sizes.quivered_hq.fasta along with all_sizes.quivered_lq.fasta corrected using Illumina RNASeq data was used as input. It seems proovread replaced some bases in the original sequence with 'N' and the script exit with error while reconstructing contigs. """ Processed 3 queries in 0.17 seconds (17.65 queries/sec) sequence T02.PB33997 contains non A/T/C/G characters! Not OK! Please fix the offending sequences first. Abort. """

Is there any way to solve this issue without omitting corrected lq sequences? Thanks.

Magdoll commented 6 years ago

Hi @fantastycrane ,

Cogent does not allow non-ATCG characters. If proovread inserted "N" bases, my suggestion is to first convert them all to some base -- one of A, T, C, G --- so that they would be accepted as input.

On the other hand, I'm curious as to why proovread would introduce N bases.

--Liz

fantastycrane commented 6 years ago

@Magdoll Thanks. Proovread substitutes non-ATGC with N. I have checked quivered output and there is no such problem. Something else must be going wrong.