GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

Read Support/Saturation Curve issue; tama_sampling_saturation_curve.py --> "Error with read support multiple genes!" [non-unique transcript names] #110

Closed mrendleman closed 11 months ago

mrendleman commented 12 months ago

Primary issue

Hi! I'm trying to combine two library preps (not 5' selected) into a single transcriptome and create a gene saturation curve. For the individual runs, I mapped their HQ isoform clusters .bam output from IsoSeq3 cluster, and then used TAMA collapse, merge, and read_support_levels following the wiki's instructions. I think TAMA collapse and merge worked correctly. The read support script doesn't throw any error, but the read support .txt gives tama_sampling_saturation_curve.py some trouble:

$ tama_sampling_saturation_curve.py -r samples_rsup_read_support.txt -b 10000 -o samples_curve_10kbins.txt
opening report file
Error with read support multiple genes!
transcript/29179        G58     G139

A quick grep through the read_support file shows that the transcript does provide support for more than one gene:

$ grep 'transcript/29179' samples_rsup_read_support.txt : image

Here's the read support call: tama_read_support_levels.py -f samples_rsup_files.txt -m samples_merged_merge.txt -o samples_rsup, where samples_rsup_files.txt is:

sample1       sample1_trans_read.bed        trans_read
sample2       sample2_trans_read.bed        trans_read

Any idea what might be going wrong?

Extra details

Here are the commands I used:

Mapped with: pbmm2 align --preset ISOSEQ -j 12 --sort sample1.clustered.hq.bam /path/to/my/ref/genome.fa.isoseq.mmi sample1.mapped.bam

Collapsed with: tama_collapse.py -b BAM -s sample1.mapped.bam -f /path/to/my/ref/genome.fa -p sample1 -x no_cap

Merged with: tama_merge.py -f samples_files.txt -p samples_merged, where samples_files.txt is:

sample1_collapsed.bed no_cap  1,1,1   sample1
sample2_collapsed.bed no_cap  1,1,1   sample2

While troubleshooting, I wondered if I should be giving tama_read_support_levels.py the cluster_report.csv files instead of the trans_read.bed since the collapsed models came from isoseq cluster output, but that causes a different error and seems incorrect based on the wiki docs.

mrendleman commented 11 months ago

Update: It turns out that the transcripts aren't actually supporting more than one gene, it's that the transcripts don't have unique transcript IDs. Sample 1's transcript/29179 supported G139, and sample 2's transcript/29179 supported G58.

To fix the issue, I manually find/replaced the transcript IDs in the individual samples' trans_read.bed files so that they're unique, e.g. replacing "transcript" with "transS1" for sample1. I hope this saves someone else some trouble!