isoseq collapse results in redundant isoforms

christine-liu commented 6 months ago

Operating system Which operating system and version are you using?

Ubuntu 20.04.6

Package name Which package / tool is causing the problem? Which version are you using, use tool --version. Have you updated to the latest version conda update package? Have you updated the complete env by running conda update --all? Have you ensured that your channel priorities are set up according to the bioconda recommendations at https://bioconda.github.io/#set-up-channels?

isoseq 4.0.0

Conda environment What is the result of conda list? (Try to paste that between triple backticks.)

Name                    Version                   Build  Channel
isoseq                    4.0.0                h9ee0642_0    bioconda

Describe the bug A clear and concise description of what the bug is.

I'm a longtime user of cDNA_Cupcake and its collapse_isoforms_by_sam.py function that has been since replaced by the isoseq collapse function. In incorporating isoseq collapse into my workflow, I've noticed that it doesn't reproduce the same results as cDNA_Cupcake -- my expectation based on the cDNA_Cupcake function is that the output should not include two isoforms that have the exact same start coordinate, end coordinate, and intron chain (exon junction coordinates). isoseq collapse has resulted in a number of redundant isoforms each time that I've run it.

Error message

No error message

To Reproduce

I've been able to reproduce this problem using the PacBio-provided Kinnex single-cell datasets from DATA-SQ2-PBMC_10kcells and DATA-SQ2-PBMC_5kcells, starting the analysis from downloading the tagged.refined.corrected.sorted.dedup.bam file, mapping it to the reference genome using pbmm2, and running isoseq collapse. The easiest way to see the redundant isoforms is to convert the resulting gff to a gtf to a genePred file (gffread to output gtf, gtfToGenePred to generate the genePred file) and then use awk -F $'\t' '{print $2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$13,$14,$15}' OFS=$'\t' collapsed.genePred | sort - | uniq -c | sort -rnk1,1 | less -S to see which isoform structures are exactly the same (start, end, intron chain).

Expected behavior A clear and concise description of what you expected to happen.

I expect isoseq collapse to output a gff with non-redundant isoforms based on start/end coordinate and intron chain.

Let me know if there's anything more information I can provide! I'd love to switch over to using isoseq collapse instead of cDNA_Cupcake as it is significantly faster :)

armintoepfer commented 5 months ago

Adding @Magdoll to triage.

TinyTasy commented 4 months ago

Hello!

Even though this was posted some while ago, I would like to add that I also observed this with my own data. In total, I find 660 redundant isoforms after using isoseq collapse. Most of them are actually Full-Splice Matches. Is there any explanation why this can happen? Thankful for any information.

Best wishes, Anastasiya

JiangyanYu commented 3 months ago

Hello,

I am facing exactly the same problem now. After the isoseq collapse step, too many redundant isoforms are formed. If I put these sequence in to UCSC browser, it is clear that they are exactly the same isoform. As a result, after make-seurat to generate a count matrix, I found more than half of the isoforms belongs to one gene.

UPDATE at 23-May-2024: I checked the parameters used in cDNA_cupcake and isoseq collapse, and realise isoseq collapse decreased the maximum allowed 5' difference if on same exon to 50bp, while cDNA_cupcake is 1000bp. Is this change the major cause of the difference?

Best, Jiangyan

Magdoll commented 2 months ago

Hi @JiangyanYu @TinyTasy @christine-liu -

The change in more redundant isoforms is likely contributed by the change in the --max-5-diff parameter default in isoseq collapse. We had changed to that parameter to be able to distinguish better isoforms with 5' differences (IMHO the original default of 1000bp was way too high). For now, can you re-run isoseq collapse with --max-5-diff set to a different default and let us know if that changes anything?

Thanks! -Liz

JiangyanYu commented 2 months ago

Dear @Magdoll ,

Thanks for the clarifications. I have tried to set the --max-5-diff as 1000bp, as the cDNA-cupcake vignette. Unfortunately, it did not help much. There are still excessive isoforms left after the whole iso-seq pipeline.

Later I realise actually in the iso-seq pipeline for scRNA data, before generating the Seurat count matrix, pigeon filter didn't filter out any isoforms. I have tried both default command: pigeon filter, and more specific setting: pigeon filter-a 0.6 -r 6-m 50 -c 3 -mono-exon. Both settings do not filter out isoforms 3'-downstream A percentage. This is super weird. Nevertheless, now after obtained the Seurat count matrix, I took the _filtered.classification.txt file that has the detailed information for each isoform, and performed filtering on the 3'-downstream A percentage and other parameters. After this, I am able to generate rather short list of isoforms.

Maybe you can comment a bit on the pigeon step, whether any low quality isoforms have been filtered out? Thanks a lot!

Best, Jiangyan

christine-liu commented 2 months ago

Hey @Magdoll Liz,

I’m positive it’s not an issue with the difference in the max-5-diff parameter because the redundant isoforms I’m observing have the exact same start, end, and intron chain coordinates. They’re just not being collapsed at all. I’d expect that changing the max-5-diff parameter to be smaller would result in more isoforms compared to the cDNA_cupcake output, but I’d still expect to see isoforms with distinct start/end/intron chain coordinates.

@JiangyanYu can you open a separate issue for your question about pigeon instead of piggybacking off of this one? The pigeon output isn’t related to the original issue here. Thanks!

On Wed, Jun 26, 2024 at 10:49 AM Jiangyan Yu @.***> wrote:

Dear @Magdoll https://github.com/Magdoll ,

Thanks for the clarifications. I have tried to set the --max-5-diff as 1000bp, as the cDNA-cupcake vignette. Unfortunately, it did not help much. There are still excessive isoforms left after the whole iso-seq pipeline.

Later I realise actually in the iso-seq pipeline for scRNA data, before generating the Seurat count matrix, pigeon filter didn't filter out any isoforms. I have tried both default command: pigeon filter, and more specific setting: pigeon filter-a 0.6 -r 6-m 50 -c 3 -mono-exon. Both settings do not filter out isoforms 3'-downstream A percentage. This is super weird. Nevertheless, now after obtained the Seurat count matrix, I took the _filtered.classification.txt file that has the detailed information for each isoform, and performed filtering on the 3'-downstream A percentage and other parameters. After this, I am able to generate rather short list of isoforms.

Maybe you can comment a bit on the pigeon step, whether any low quality isoforms have been filtered out? Thanks a lot!

Best, Jiangyan

— Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pbbioconda/issues/651#issuecomment-2192297088, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD3FDK6V3J36K6SZT6Z76TLZJL5LZAVCNFSM6AAAAABDHIAIKOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJSGI4TOMBYHA . You are receiving this because you were mentioned.Message ID: @.***>

omarelgarwany commented 1 month ago

I'm running into exactly the same issue. Same 5' and 3' ends and identical intron chains. I run isoseq collapse with the default parameters. This is an excerpt of my GTF file:

chr4 PacBio transcript 70655543 70666508 . - . gene_id "PB.61453"; transcript_id "PB.61453.709"; chr4 PacBio exon 70655543 70656539 . - . gene_id "PB.61453"; transcript_id "PB.61453.709"; chr4 PacBio exon 70657211 70657291 . - . gene_id "PB.61453"; transcript_id "PB.61453.709"; chr4 PacBio exon 70662092 70662215 . - . gene_id "PB.61453"; transcript_id "PB.61453.709"; chr4 PacBio exon 70666427 70666508 . - . gene_id "PB.61453"; transcript_id "PB.61453.709"; chr4 PacBio transcript 70655543 70666508 . - . gene_id "PB.61453"; transcript_id "PB.61453.710"; chr4 PacBio exon 70655543 70656539 . - . gene_id "PB.61453"; transcript_id "PB.61453.710"; chr4 PacBio exon 70657211 70657291 . - . gene_id "PB.61453"; transcript_id "PB.61453.710"; chr4 PacBio exon 70662092 70662215 . - . gene_id "PB.61453"; transcript_id "PB.61453.710"; chr4 PacBio exon 70666427 70666508 . - . gene_id "PB.61453"; transcript_id "PB.61453.710";

JiangyanYu commented 1 month ago

@omarelgarwany , following steps solved my problem:

1) since I have multiple scRNA datasets that I need to merge, I followed an old protocol from @Magdoll: https://github.com/Magdoll/cDNA_Cupcake/wiki/Handling-multiple-samples:-Creating-a-master-transcriptome

note 1.1) in this protocol, TALON was used to generate a universal isoform ID from all samples, thus allowing the merging. note 1.2) I do notice after TALON, these redundant isoforms are merged into one unique isoform.

2) next step is manual filtering of the isoforms, such as based on per_A_downstream_TSS, number of cells with this isoform, number of reads supporting this isoform, etc.

Hope it helps!

Jiangyan

omarelgarwany commented 1 month ago

Hi @JiangyanYu

Thanks a lot! That's definitely reassuring. I ended up following a very similar logic, but using TAMA instead. It takes care of all those redundant isoforms as well as isoforms with 5'/3' ends that are similar enough. However, I still find it a bit puzzling that isoseq collapse does not take care of those isoforms. It's still part of my pipeline, because it gives me abundance files with I find useful to run make-seurat, so I hope this issue can be resolved soon.

Kind regards Omar

PacificBiosciences / pbbioconda

isoseq collapse results in redundant isoforms #651