Chaining generating redundant isoforms?

SziKayLeung commented 3 years ago

Hello Liz,

Hope you are well. I was trying out the chaining on 11 of my samples, and the script ran successfully. However, I noticed some seemingly redundant transcripts that appear to be the same but are labelled as different isoforms. For one particular gene (same first part PacBio ID - "PB.2141"), there were over 23,000 isoforms annotated to it....

I have attached an example of the fasta sequence of two isoforms from this gene and a screenshot from UCSC (blasting the two sequences also gave 100% identity after accounting for small letters/capitals). All the other files generated from chaining can be found here: https://www.dropbox.com/s/6nfwsrn4nxjh96g/Chain_Query.tar.gz?dl=0

cupcake version: v17.0 (run in python 3) chain_samples.py Chain_Configuration.txt count_fl --dun-merge-5-shorter 2> mm10_Chained_Configuration.log

chain_redundant.txt redundant_transcript

Wondering what your thoughts are on this and if I'm missing anything since the new update - any guidance on this would be great. Thank you for all your help,

Best, Szi Kay

P.S one of my samples failed to chain and kept crashing with the split gff files being generated inside the folder rather than outside. Removing that sample from the configuration file worked fine though...

SziKayLeung commented 3 years ago

Hello Liz,

It appears that all the overinflated isoforms are from single-exon genes (mono-exon transcripts) and I know that this is an inherent problem of chaining with regards to this (looking back on the previous issues).

Is there a way to overcome this - (perhaps removing the monoexonic transcripts prior to chaining). I've played around with the -max_3_diff parameter, but this hasn't made much of a difference.

Thanks, Szi Kay

Magdoll commented 3 years ago

Hi @SziKayLeung did you look at the --max_5_diff parameter (since most monoexonic diff I expect is on the 5' end)?

SziKayLeung commented 3 years ago

Hi Liz, If I'm not mistaken and I'm using the chain_samples.py script, there isn't a --max_5_diff parameter?

usage: chain_samples.py [-h] [--fuzzy_junction FUZZY_JUNCTION] [--dun-merge-5-shorter] [--max_3_diff MAX_3_DIFF] [--cpus CPUS] config_file {norm_fl,count_fl}

positional arguments: config_file {norm_fl,count_fl} Which count field to use for chained sample (default: count_fl)

optional arguments: -h, --help show this help message and exit --fuzzy_junction FUZZY_JUNCTION Max allowed distance in junction to be considered identical (default: 0 bp) --dun-merge-5-shorter Don't collapse shorter 5' transcripts (default: off) --max_3_diff MAX_3_DIFF Maximum 3' difference allowed (default: 30bp) --cpus CPUS Number of CPUs to use for multi-threading (default: 8)

Magdoll commented 3 years ago

Hi @SziKayLeung , sorry i was thinking about the collapse_isoforms_by_sam.py step where there is a --max_5_diff parameter which comes before running chain_samples.py. Did you look at the mono-exonic transcripts and see if within the same sample there are multiple mono-exonic transcripts that only differ by some amount of 5' differences, and how big were those 5' diffs? -Liz

Magdoll / cDNA_Cupcake

Chaining generating redundant isoforms? #141