Detection of FLNC reads with Isoseq3

wyzhangMPI commented 5 years ago

Hello, We are trying to use isoseq to detect new isoforms, and would like to know about the procedure of the detection of the full-length non-chimeric reads (FLNC). To be a FLNC, one read needs to contain: 1) both 5'primer and 3'primer; 2) PolyA signature. 3) Not chimeric, with >1 transcripts joining together. "Lima" can be used to detect the first signature, with proper 5p--3p orientation. This step may not explore the chimeric signature. Would the "--require-ploya" in cluster step try to find multiple polyAs in the read and remove these kinds of reads? However, I also found quite a lot of reads (~5k reads) were not correctly classified as FLNC reads, with some containing sequencing error in the middle of polyA (e.g., AAAAAAAAGAAAAAAAAAAAAAAAAAAAAA). But again, I also found some CCS reads were not classified as FLNC, even without any sequencing errors.

For example (the first 10 bases are UMI): m54124_181029_181314/10027310/ccs 20 0 255 * 0 0 GGTGAGAAAGGTTTTTTTTTTTTTTTTTTTTTTTTTAGGGCAGGCCAAACCCTAGTTTATTTCAGTGTCAGCAACAGCTTAGCCATCAAAAAAATAACTCTACCCAGGCGACAGAAGTCTCTACAGCGAGGCTAAGGGTCAGCCGCCAGGCGGCAAACATCAAGGATGCATGGCCGGCACGCCCGGGTAATAAGTTAGGAAGCGGCAGCCTGATGGTGGTGAGGGCCAGGCTTCACTTCTGGGCCGGCATGAGGTCATCGATTGCCTGACCCTGCTCGAGCCGTATTGCTCCATCTCAATGAGTAGTTTCACTCCGTCCACCACCATCTGCACCAGTTCCACCTCCGAGAAGCCCAGGCGGTCAGCGTTAGAGACATCAAACACCCACCGACAGCAGCGGTGTCCACGCCACCTGTGCCTCGCTTCTGAAGCCGCAGCCGCTTGAGCACCTCCGAGAACTTCTCGTGCTTCCCCAGGTGGGGCAGCTTGTGTGCACACCTGCCCGCAGTCCGGTGCTCAGGTTGGATGGGCAGTGTGAGGATGTAGCCCAGGTGAGGATTCCACATGAACTCATAGTCTTGGACTTGAAGAGAGTTTCGATCTGAGTGAGGCCGGTGCAGAATCGGGGAACACTTCCTTCATGTTGCCCCCCTTCTGCATGGAGATGACTCGCAGGTGGTCCTCCTCTTTAATCCACACCAGGGAAAGTCTTATTGTCATTGTGCCATATGCACGAGCATCCGGCCATTCGCTTGGCCATGCCGGAGGCCAGCAGCAGAGGCGACACAGGCTTATCGAAGAGGAAGTGGTCGTCAATGAGCTGCTGCTGCTCCGCCTCAGTCATGCTCGAGCGCGTAGTACCTGCCAACAGGTCGCCATCTAGGCTGGACGAGCTTCTACTGCCACTTCTCGATGGCGCGGCGCTCCCCGCGGCTGGCAGTGCGGGGGAGACAGAAGCCGCGGATGCTGCGGCCTGTGCGCACTCGCGAGCTCAGCACGTAGTTGGGGTCCAGGTCTCGCCACCCTGCAGGTTGTCTGGGTTGAGGTCGGTCTTGTGCTCATCATGGGCTGGTAGCCGCCGTGCCGCTCCTCAATAATGGGGTCCGAAGAGGTCCTTGAATACGTCGTAACTCTCCTCGTCGCCCGCCACTGCACCCACAGTCATGACTGTATCGGGTGGCCCGGATTGTCTACGCCAGTCTGAATGGCGTCGTCCAAAGTAAAGCCGCCGGCGTGCACTTGGCACGGAGCTCGGCGTACAGCTCGGGGTCAGCACCTTGGCATATGGTTGTTGTGGCTGCTCAGATCAGGGAACTCTCCTCGGCCGGAAGCGCAGCTTCTGCGTATTATGGCTGTTGGAGAAGGGCATGGCGGCGGCAGCGGGCGGATGCCGGACGAAGCAGGGCGATGGATGCCTGTGCTTGCAGCTCCTGGCGCAGCGCAGAAGAAAC 55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555 RG:Z:e175cfed bx:B:i,25,29 np:i:3 rq:f:0 rs:B:i,5,0,0,0,0,0 sn:B:f,6.00709,11.2122,5.56258,9.58404 za:f:0 zm:i:10027310 zs:B:f,0 qs:i:25 qe:i:1476 bc:B:S,1,0 bq:i:95 cx:i:12

Does anyone know how to fix this issue? Thanks!

armintoepfer commented 5 years ago

1) The cluster step trims polyA tails. It also identifies concatemers by searching for the SMRT Bell hairpin sequence. 2) UMIs are not supported. You must trim them before going into the isoseq pipeline. Clustering with UMIs is beyond this bioconda support channel.

wyzhangMPI commented 5 years ago

Thanks for the explanation. Just a few more related questions: 1) Would that be possible for the cluster step to trim poly A tails with seldom sequencing errors, e.g. AAAAAAAAAAAAAGAAAAAAAAAAAAA. 2) The searching for concatemers (SMRT Bell hairpin sequence) is done in the middle of CCS reads, is that correct? The purpose of this option is to detect and remove chimerica reads, right? Also the detection and removal of wrong orientation reads (3p--3p or 5p---5p) from lima step is conducted as an alternative way for the removal of chimeric reads??? 3) If I understood correctly, the detection of poly A requires at least 20 times repeat of As. Also the no more than 10 over-hang bases is allowed. For instance, AAAAAAAAAAAAAAAAAAAAAAAAAAAAAATCGCGATTGT can still be considered as polyA. Most of UMIs seem to be good, but I will also trim the UMIs on my own. 4) I am wondering for the intial CCS step, a "no-polish" option is recommened. Does that mean a subread was randomly assigned to represent that CCS, would any error correction step be done at this point? If "polish" option is applied, what more procudures will be done? 5) Isoseqs cluster would only output transcript with at least 2 supporting CCS, thus may introduce false negatives (i.e., the ones with only one supporting CCS). Thus we would more like to use FLNC.bam data for further analysis to get more results. Would that be possible to do "isoseq3 poilish' on the FLNC.bam data (all full-length non-chimerica data), instead of the clustered bam data?

From: Armin Töpfer Date: 2018-11-18 20:08 To: PacificBiosciences/pbbioconda CC: wyzhangMPI; Author Subject: Re: [PacificBiosciences/pbbioconda] Detection of FLNC reads with Isoseq3 (#51) The cluster step trims polyA tails. It also identifies concatemers by searching for the SMRT Bell hairpin sequence. UMIs are not supported. You must trim them before going into the isoseq pipeline. Clustering with UMIs is beyond this bioconda support channel. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread..

armintoepfer commented 5 years ago

1] We trained a HMM for polyA detection, residual errors are allowed. 2a] SMRT Bell hairpin sequence detection is performed across the whole read. 2b] Wrong orientations are not written to the output file by lima, so they don't even make it into the clustering step. 3] Yes, at least 20 As, but if the read has to start with a polyA. The Viterbi may allow a few residuals, but likely not your full UMI. 4] no-polish takes the subreads, creates a partial-order-alignment, and calls a consensus sequence. This is noisier, but of sufficient quality, compared to the much longer taking polished CCS. Motivation is purely speed. You can also take fully polished CCS as input. 5] I don't trust a single molecule. I'd rather be concerned by a ton of FPs than your FNs. To counter the FN argument, sequence more. If you use polished CCS as input, there is no need to polish the FLNCs. In the upcoming version 3.1, we will introduce a pre-processing step that has CCS as input and generates FLNCs that will be used for clustering.

wyzhangMPI commented 5 years ago

Thanks for the clear response. Just one more question at this moment: Regarding to the removal of chimeric transcripts, the lima step would only search from 5' and 3' primers at both ends, right? I mean if there are chimeric reads (not biologically meaningful), there might be multiple primers or polyAs (before the adding of SMRT bell hairpin adapter) in the middle of the reads. Is there any way to remove the chimeric reads for this purpose?

------ Original Message ------ From: "Armin Töpfer" notifications@github.com To: "PacificBiosciences/pbbioconda" pbbioconda@noreply.github.com Cc: "wyzhangMPI" wyzhang@evolbio.mpg.de; "Author" author@noreply.github.com Sent: 11/18/2018 10:43:37 PM Subject: Re: [PacificBiosciences/pbbioconda] Detection of FLNC reads with Isoseq3 (#51)

We trained a HMM for polyA detection, residual errors are allowed. 2a) SMRT Bell hairpin sequence detection is performed across the whole read. 2b) Wrong orientations are not written to the output file by lima, so they don't even make it into the clustering step.Yes, at least 20 As, but if the read has to start with a polyA. The Viterbi may allow a few residuals, but likely not your full UMI.no-polish takes the subreads, creates a partial-order-alignment, and calls a consensus sequence. This is noisier, but of sufficient quality, compared to the much longer taking polished CCS. Motivation is purely speed. You can also take fully polished CCS as input.I don't trust a single molecule. I'd rather be concerned by a ton of FPs than your FNs. To counter the FN argument, sequence more. If you use polished CCS as input, there is no need to polish the FLNCs. In the upcoming version 3.1, we will introduce a pre-processing step that has CCS as input and generates FLNCs that will be used for clustering. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pbbioconda/issues/51#issuecomment-439728429, or mute the thread https://github.com/notifications/unsubscribe-auth/AbjeYACZHgLA6QZ7iXbNM6DVtMGfD16tks5uwdSJgaJpZM4Yh3qT.

armintoepfer commented 5 years ago

That's exactly the reason I don't trust a single molecule. What are the chances that you detect the same chimeric read twice? The human genome contains homopolymer A stretches >20bp, so looking for polyA in the middle of the read is not the best approach. One could look for primers, but I don't, because as I said initially, those molecules will unlikely form clusters. My goal is to create meaningful clustering results. If your goal is to refine FLNCs without running clustering then you are on your own for now.

wyzhangMPI commented 5 years ago

Yes, I understand your logic. Thanks!

------ Original Message ------ From: "Armin Töpfer" notifications@github.com To: "PacificBiosciences/pbbioconda" pbbioconda@noreply.github.com Cc: "wyzhangMPI" wyzhang@evolbio.mpg.de; "Author" author@noreply.github.com Sent: 11/19/2018 4:16:26 PM Subject: Re: [PacificBiosciences/pbbioconda] Detection of FLNC reads with Isoseq3 (#51)

That's exactly the reason I don't trust a single molecule. What are the chances that you detect the same chimeric read twice? The human genome contains homopolymer A stretches >20bp, so looking for polyA in the middle of the read is not the best approach. One could look for primers, but I don't, because as I said initially, those molecules will unlikely form clusters. My goal is to create meaningful clustering results. If your goal is to refine FLNCs without running clustering then you are on your own for now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pbbioconda/issues/51#issuecomment-439927517, or mute the thread https://github.com/notifications/unsubscribe-auth/AbjeYK-vdptYeDAiN06hnQkUskvdyHSRks5uwstKgaJpZM4Yh3qT.

armintoepfer commented 5 years ago

If you happen to find a massive amount of chimeric reads, go and find your sample prep person :)

PacificBiosciences / pbbioconda

Detection of FLNC reads with Isoseq3 #51