Half of the HQ isoforms are classified as antisense and NIC

Magdoll / SQANTI2

SQANTI2 is now replaced by SQANTI3. Please go to: https://github.com/ConesaLab/SQANTI3

Other

38 stars 15 forks source link

Half of the HQ isoforms are classified as antisense and NIC #46

Closed qoiopipq closed 4 years ago

qoiopipq commented 5 years ago

I've done SQANTI2 on PacBio Sequal high-quality collapsed isoforms from Iso-Seq + Tofu. I found more than 50% of the isoforms are classified as "antisense" and "NIC". I used GMAP to map against Ensembl mouse genome.

I also tried proovread to correct high-quality isoforms by short reads, and from SQANTI2 report, number of "NIC" dropped and "ISM" increased but "antisense" still around 25%.

Is it normal to see half of isoforms as "antisense" and "NIC"? Or is it because not having enough coverage?

Thanks!

Magdoll commented 5 years ago

Hi @qoiopipq ,

A very high proportion of anti-sense - especially if they are mostly mono-exonic, is not expected. It could indicate a library issue or some really unusual biological phenomenon. Are these fetal mouse samples? Brain?

I do not believe error correction is at issue here. If you are using HQ transcripts, the expected accuracy should be > 99%. NIC also means these are novel isoforms using known, canonical junctions, so mapping to the junctions appear to be clean, as well. If errors are an issue you should be seeing a lot of non-canonical junctions (likely in the NNC category).

ISM means incomplete splice matches - 5' degraded products, likely. So again, nothing to do with sequencing/consensus errors.

I would suggest running sqanti_filter2.py to see if most of the anti-sense are filtered out.

--Liz

qoiopipq commented 5 years ago

Hi @Magdoll

These are mouse immune cells samples. I noticed that %polyA intra-priming for these antisense mono-exonic reads are very high (>80%). I looked at some of these and found that these possible polyA intra-priming sites are within introns. I think it's more likely to indicate a library preparation issue. Then I tried running sqanti_filter2.py and used the intra-priming parameter as default: 0.8 and most of these antisense reads are removed. But there were around 100 of antisense reads with % of intra-priming > 80 % left after sqanti_filter2. Are there any reasons that these are not filtered out?

Thanks!

Magdoll commented 5 years ago

Hi @qoiopipq ,

Good to know that the filtering step removes most of them.

For anti-sense, even if it is >80% "A" stretch, if there is a detected polyA motif (did you supply with the --polyA_motif_list option?) it will not be filtered out.

--Liz

qoiopipq commented 4 years ago

Hi @Magdoll

I didn't specify --polyA_motif_list option. After running sqanti_filter2.py, there were a bit less than 3% of transcripts with polyA intra-priming > 80%. Is it a reasonable amount of transcripts remaining in the data set?

Cheers!

Magdoll commented 4 years ago

Hi @qoiopipq , 3% sounds pretty reasonable. If this is human you can supply the polyA list given here

qoiopipq commented 4 years ago

Hi @Magdoll Thanks for your help. Just another question about sqanti2-filter.py, it's not related to the question I asked before, but I just leave it here. I found that in the output lite.gtf file, the gene_id column is labelled with "PB" for all exons and transcripts. Not sure if it's just a bug: 1 PacBio transcript 5083743 5086778 . + . gene_id "PB"; transcript_id "PB.4.1"; 1 PacBio exon 5083743 5086778 . + . gene_id "PB"; transcript_id "PB.4.1"; 1 PacBio transcript 5148898 5162443 . + . gene_id "PB"; transcript_id "PB.5.1"; 1 PacBio exon 5148898 5150061 . + . gene_id "PB"; transcript_id "PB.5.1"; 1 PacBio exon 5162105 5162443 . + . gene_id "PB"; transcript_id "PB.5.1";

I just used awk and sed to fix it.

Cheers!

Magdoll commented 4 years ago

Hi @qoiopipq , This GFF gene_id issue should now be fixed in the latest version of Cupcake (v9.0.3).

--Liz