Magdoll / cDNA_Cupcake

Miscellaneous collection of Python and R scripts for processing Iso-Seq data
BSD 3-Clause Clear License
257 stars 104 forks source link

duplicate multi-exon isoforms in chain_samples.py output #55

Open chenzler opened 5 years ago

chenzler commented 5 years ago

I'm having similar problems to an issue from 2017 (duplicate isoforms in chain_samples.py output), where I'm seeing duplicate isoforms in the output of chain_samples.py, however the isoforms are all multi-exon (unlike in the previous issue). It seems to stem from a similar issue, i.e., the last exon is subdivided into two exons in some transcripts/samples but not others, and that seems to cause problems with the chaining. Any ideas what I can do to force it to chain properly? Does the order of samples for the chaining (i.e. order in the config file) make a difference in this type of situation? In some small tests I've done, it seems to make a difference. I'm running the latest version of cDNA_Cupcake.

Here's a small example. Counts for 3 chained transcripts for the same gene superPBID samp1 samp2 samp3 PB.184.2 280.0 3.0 2.0 PB.184.16 280.0 3.0 1987.0 PB.184.59 280.0 5744.0

Transcript IDs from each sample assigned to chained superPBIDs superPBID samp1 samp2 samp3 PB.184.2 PB.34.2 PB.48.43 PB.25.1 PB.184.16 PB.34.2 PB.48.43 PB.25.2 PB.184.59 PB.34.2 PB.48.26

GFF data for each sample. I've removed interior exons from the gff data that are identical across all samples for brevity. Sample 1 chrX PacBio transcript 66766497 66944207 . + . gene_id "PB.34"; transcript_id "PB.34.2"; chrX PacBio exon 66766497 66766604 . + . gene_id "PB.34"; transcript_id "PB.34.2"; chrX PacBio exon 66942669 66942826 . + . gene_id "PB.34"; transcript_id "PB.34.2"; chrX PacBio exon 66943528 66944207 . + . gene_id "PB.34"; transcript_id "PB.34.2";

Sample 2 chrX PacBio transcript 66766497 66944207 . + . gene_id "PB.48"; transcript_id "PB.48.26"; chrX PacBio exon 66766497 66766604 . + . gene_id "PB.48"; transcript_id "PB.48.26"; chrX PacBio exon 66942669 66942826 . + . gene_id "PB.48"; transcript_id "PB.48.26"; chrX PacBio exon 66943528 66944207 . + . gene_id "PB.48"; transcript_id "PB.48.26"; chrX PacBio transcript 66766501 66944203 . + . gene_id "PB.48"; transcript_id "PB.48.43"; chrX PacBio exon 66766501 66766604 . + . gene_id "PB.48"; transcript_id "PB.48.43"; chrX PacBio exon 66942669 66942826 . + . gene_id "PB.48"; transcript_id "PB.48.43"; chrX PacBio exon 66943528 66943732 . + . gene_id "PB.48"; transcript_id "PB.48.43"; chrX PacBio exon 66944085 66944203 . + . gene_id "PB.48"; transcript_id "PB.48.43";

Sample 3 chrX PacBio transcript 66766495 66944195 . + . gene_id "PB.25"; transcript_id "PB.25.1"; chrX PacBio exon 66766495 66766604 . + . gene_id "PB.25"; transcript_id "PB.25.1"; chrX PacBio exon 66942669 66942826 . + . gene_id "PB.25"; transcript_id "PB.25.1"; chrX PacBio exon 66943528 66944119 . + . gene_id "PB.25"; transcript_id "PB.25.1"; chrX PacBio exon 66944122 66944195 . + . gene_id "PB.25"; transcript_id "PB.25.1"; chrX PacBio transcript 66766495 66944207 . + . gene_id "PB.25"; transcript_id "PB.25.2"; chrX PacBio exon 66766495 66766604 . + . gene_id "PB.25"; transcript_id "PB.25.2"; chrX PacBio exon 66942669 66942826 . + . gene_id "PB.25"; transcript_id "PB.25.2"; chrX PacBio exon 66943528 66944207 . + . gene_id "PB.25"; transcript_id "PB.25.2";

GFF data for the chained data: chrX PacBio transcript 66766496 66944207 . + . gene_id "PB.184"; transcript_id "PB.184.2"; chrX PacBio exon 66766496 66766604 . + . gene_id "PB.184"; transcript_id "PB.184.2"; chrX PacBio exon 66942669 66942826 . + . gene_id "PB.184"; transcript_id "PB.184.2"; chrX PacBio exon 66943528 66944207 . + . gene_id "PB.184"; transcript_id "PB.184.2"; chrX

chrX PacBio transcript 66766495 66944207 . + . gene_id "PB.184"; transcript_id "PB.184.16"; chrX PacBio exon 66766495 66766604 . + . gene_id "PB.184"; transcript_id "PB.184.16"; chrX PacBio exon 66942669 66942826 . + . gene_id "PB.184"; transcript_id "PB.184.16"; chrX PacBio exon 66943528 66944207 . + . gene_id "PB.184"; transcript_id "PB.184.16";

chrX PacBio transcript 66766496 66944207 . + . gene_id "PB.184"; transcript_id "PB.184.59"; chrX PacBio exon 66766496 66766604 . + . gene_id "PB.184"; transcript_id "PB.184.59"; chrX PacBio exon 66942669 66942826 . + . gene_id "PB.184"; transcript_id "PB.184.59"; chrX PacBio exon 66943528 66944207 . + . gene_id "PB.184"; transcript_id "PB.184.59";

Magdoll commented 5 years ago

Hi @chenzler , I will be able to look at this example in more detail after the weekend. But briefly, you can use the --dun-merge-5-shorter option in chain_samples.py so the number of exons must match.

--Liz

qoiopipq commented 4 years ago

@Magdoll @chenzler I've seen duplicate multi-exon isoforms in the output, which like what you described before. These isoforms share identical internal junctions but number of exons do not match. This makes sense to me.

However, I found that identical transcripts (same coordinates, same number of exons) in my 2 samples (one in each sample) got chained into 2 separate superPBIDs (PB.1008.1 and PB.1008.3 are identical). I think these identical transcripts in 2 samples should be chained as 1 transcript in the output. But, here is the output: In chained ID file: superPBID sample 1 sample 2 PB.1008.1 NA PB.759.1 PB.1008.2 PB.659.1 NA PB.1008.3 PB.659.2 NA

In chained gff file of PB.1008.1: $ grep PB.1008.1 ./SQANTI2/all_samples.chained_classification.filtered_lite_renamed.gtf 10 PacBio transcript 127095654 127100104 . - . gene_id "PB.1008"; transcript_id "PB.1008.1"; 10 PacBio exon 127095654 127096372 . - . gene_id "PB.1008"; transcript_id "PB.1008.1"; 10 PacBio exon 127096630 127096742 . - . gene_id "PB.1008"; transcript_id "PB.1008.1"; 10 PacBio exon 127097932 127098121 . - . gene_id "PB.1008"; transcript_id "PB.1008.1"; 10 PacBio exon 127098341 127098637 . - . gene_id "PB.1008"; transcript_id "PB.1008.1"; 10 PacBio exon 127098831 127098916 . - . gene_id "PB.1008"; transcript_id "PB.1008.1"; 10 PacBio exon 127099403 127099457 . - . gene_id "PB.1008"; transcript_id "PB.1008.1"; 10 PacBio exon 127099613 127099707 . - . gene_id "PB.1008"; transcript_id "PB.1008.1"; 10 PacBio exon 127099890 127099991 . - . gene_id "PB.1008"; transcript_id "PB.1008.1"; 10 PacBio exon 127100085 127100104 . - . gene_id "PB.1008"; transcript_id "PB.1008.1";

In chained gff file of PB.1008.3: $ grep PB.1008.3 ./SQANTI2/all_samples.chained_classification.filtered_lite_renamed.gtf 10 PacBio transcript 127095654 127100104 . - . gene_id "PB.1008"; transcript_id "PB.1008.3"; 10 PacBio exon 127095654 127096372 . - . gene_id "PB.1008"; transcript_id "PB.1008.3"; 10 PacBio exon 127096630 127096742 . - . gene_id "PB.1008"; transcript_id "PB.1008.3"; 10 PacBio exon 127097932 127098121 . - . gene_id "PB.1008"; transcript_id "PB.1008.3"; 10 PacBio exon 127098341 127098637 . - . gene_id "PB.1008"; transcript_id "PB.1008.3"; 10 PacBio exon 127098831 127098916 . - . gene_id "PB.1008"; transcript_id "PB.1008.3"; 10 PacBio exon 127099403 127099457 . - . gene_id "PB.1008"; transcript_id "PB.1008.3"; 10 PacBio exon 127099613 127099707 . - . gene_id "PB.1008"; transcript_id "PB.1008.3"; 10 PacBio exon 127099890 127099991 . - . gene_id "PB.1008"; transcript_id "PB.1008.3"; 10 PacBio exon 127100085 127100104 . - . gene_id "PB.1008"; transcript_id "PB.1008.3";

In sample1 gff file of PB.659.2: $ grep PB.659.2 ./sample1/hqIsoforms.gff 10 PacBio transcript 127095654 127100104 . - . gene_id "PB.659"; transcript_id "PB.659.2"; 10 PacBio exon 127095654 127096372 . - . gene_id "PB.659"; transcript_id "PB.659.2"; 10 PacBio exon 127096630 127096742 . - . gene_id "PB.659"; transcript_id "PB.659.2"; 10 PacBio exon 127097932 127098121 . - . gene_id "PB.659"; transcript_id "PB.659.2"; 10 PacBio exon 127098341 127098637 . - . gene_id "PB.659"; transcript_id "PB.659.2"; 10 PacBio exon 127098831 127098916 . - . gene_id "PB.659"; transcript_id "PB.659.2"; 10 PacBio exon 127099403 127099457 . - . gene_id "PB.659"; transcript_id "PB.659.2"; 10 PacBio exon 127099613 127099707 . - . gene_id "PB.659"; transcript_id "PB.659.2"; 10 PacBio exon 127099890 127099991 . - . gene_id "PB.659"; transcript_id "PB.659.2"; 10 PacBio exon 127100085 127100104 . - . gene_id "PB.659"; transcript_id "PB.659.2";

In sample 2 gff file of PB.759.1: $ grep PB.759.1 ./sample2/hqIsoforms.gff 10 PacBio transcript 127095654 127100104 . - . gene_id "PB.759"; transcript_id "PB.759.1"; 10 PacBio exon 127095654 127096372 . - . gene_id "PB.759"; transcript_id "PB.759.1"; 10 PacBio exon 127096630 127096742 . - . gene_id "PB.759"; transcript_id "PB.759.1"; 10 PacBio exon 127097932 127098121 . - . gene_id "PB.759"; transcript_id "PB.759.1"; 10 PacBio exon 127098341 127098637 . - . gene_id "PB.759"; transcript_id "PB.759.1"; 10 PacBio exon 127098831 127098916 . - . gene_id "PB.759"; transcript_id "PB.759.1"; 10 PacBio exon 127099403 127099457 . - . gene_id "PB.759"; transcript_id "PB.759.1"; 10 PacBio exon 127099613 127099707 . - . gene_id "PB.759"; transcript_id "PB.759.1"; 10 PacBio exon 127099890 127099991 . - . gene_id "PB.759"; transcript_id "PB.759.1"; 10 PacBio exon 127100085 127100104 . - . gene_id "PB.759"; transcript_id "PB.759.1";

Magdoll commented 4 years ago

Hi @ qoiopipq,

There was a chaining bug that I recently fixed in the latest Cupcake version v10.0.0. Can you try that and let me know if you find any more bugs?

Thanks, -Liz

qoiopipq commented 4 years ago

HI @Magdoll

I tried using the latest Cupcake version v10.0.0, but still getting the same results.

Here is the superPBID list: $ grep PB.1008 all_samples.chained_ids.txt superPBID sample1 sample2 PB.1008.1 PB.659.1 NA PB.1008.2 PB.659.2 NA PB.1008.3 NA PB.759.1

PB.1008.3 and PB..1008.2 are identical, showing in the chained GFF file: $ grep PB.1008.3 all_samples.chained.gff chr10 PacBio transcript 127095654 127100104 . - . gene_id "PB.759"; transcript_id "PB.1008.3"; chr10 PacBio exon 127095654 127096372 . - . gene_id "PB.759"; transcript_id "PB.1008.3"; chr10 PacBio exon 127096630 127096742 . - . gene_id "PB.759"; transcript_id "PB.1008.3"; chr10 PacBio exon 127097932 127098121 . - . gene_id "PB.759"; transcript_id "PB.1008.3"; chr10 PacBio exon 127098341 127098637 . - . gene_id "PB.759"; transcript_id "PB.1008.3"; chr10 PacBio exon 127098831 127098916 . - . gene_id "PB.759"; transcript_id "PB.1008.3"; chr10 PacBio exon 127099403 127099457 . - . gene_id "PB.759"; transcript_id "PB.1008.3"; chr10 PacBio exon 127099613 127099707 . - . gene_id "PB.759"; transcript_id "PB.1008.3"; chr10 PacBio exon 127099890 127099991 . - . gene_id "PB.759"; transcript_id "PB.1008.3"; chr10 PacBio exon 127100085 127100104 . - . gene_id "PB.759"; transcript_id "PB.1008.3";

$ grep PB.1008.2 all_samples.chained.gff chr10 PacBio transcript 127095654 127100104 . - . gene_id "PB.659"; transcript_id "PB.1008.2"; chr10 PacBio exon 127095654 127096372 . - . gene_id "PB.659"; transcript_id "PB.1008.2"; chr10 PacBio exon 127096630 127096742 . - . gene_id "PB.659"; transcript_id "PB.1008.2"; chr10 PacBio exon 127097932 127098121 . - . gene_id "PB.659"; transcript_id "PB.1008.2"; chr10 PacBio exon 127098341 127098637 . - . gene_id "PB.659"; transcript_id "PB.1008.2"; chr10 PacBio exon 127098831 127098916 . - . gene_id "PB.659"; transcript_id "PB.1008.2"; chr10 PacBio exon 127099403 127099457 . - . gene_id "PB.659"; transcript_id "PB.1008.2"; chr10 PacBio exon 127099613 127099707 . - . gene_id "PB.659"; transcript_id "PB.1008.2"; chr10 PacBio exon 127099890 127099991 . - . gene_id "PB.659"; transcript_id "PB.1008.2"; chr10 PacBio exon 127100085 127100104 . - . gene_id "PB.659"; transcript_id "PB.1008.2";

I also noticed that some identical transcripts in two different samples were chained properly, which are single-exon transcripts or transcripts with small number of exons.

Cheers!

qoiopipq commented 4 years ago

Hi Liz @Magdoll

One extra point I want to make. I've been using Ensembl annotation for this. I noticed that chain_samply.py doesn't work if GFF files in different samples do not start with 'chr' when using Cupcake version 10.0.0. Adding 'chr' at the start of each line seems to fix the problem, but still having the chaining bug I discussed before. Just want to know is it only compatible with RefSeq annotation?

Cheers!