GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
129 stars 25 forks source link

key error 'UnnamedSample_HQ_transcript' read support #24

Closed BeatrizdeToledo closed 4 years ago

BeatrizdeToledo commented 4 years ago

Good afternoon! I have some 5’-cap selected Iso-Seq data. I aligned with gmap (and still gonna alignment with desalt) and after used tama collapse. However, occurs a key error 'UnnamedSample_HQ_transcript'. You can see bellow the command, the beginning of my collapsed bed file, my hq_transcript file I used for aligning and my cluster file. Just as an extra information, my file contains 3 samples that after this I will demultiplex. Do you know why this error is occurring?

-bash-4.1$ python tama_read_support_collapse_cluster.py highquality_gmapn1_collapse_capped_trans_read.bed unpolished.cluster_report.csv output_file

USAGE: python tama_collapse_trans_support_cluster.py prefix_trans_read.bed cluster_file output_file opening collapse opening cluster Interpretting as Iso-Seq1 cluster Going through cluster Traceback (most recent call last): File "tama_read_support_collapse_cluster.py", line 148, in cluster_read_list = list(cluster_read_dict[cluster_id].keys()) KeyError: 'UnnamedSample_HQ_transcript'

-bash-4.1$ head highquality_gmapn1_collapse_capped_trans_read.bed 1 3136605 3136668 G1.1;UnnamedSample_HQ_transcript/213864 40 - 3136605 3136668 255,0,0 1 63 0 1 3199735 3202443 G2.1;UnnamedSample_HQ_transcript/132038 40 - 3199735 3202443 255,0,0 1 2708 0 1 3201364 3205200 G2.2;UnnamedSample_HQ_transcript/72631 40 - 3201364 3205200 255,0,0 1 3836 0 1 3417436 3418825 G3.1;UnnamedSample_HQ_transcript/204620 40 - 3417436 3418825 255,0,0 1 1389 0 1 4744318 4748209 G4.1;UnnamedSample_HQ_transcript/70487 40 + 4744318 4748209 255,0,0 1 3891 0 1 4758097 4758256 G5.1;UnnamedSample_HQ_transcript/213693 40 + 4758097 4758256 255,0,0 1 159 0 1 4761262 4785718 G6.1;UnnamedSample_HQ_transcript/54434 40 - 4761262 4785718 255,0,0 6 3480,194,124,166,155,146 0,13060,16262,21305,22688,24310 1 4769792 4785731 G6.2;UnnamedSample_HQ_transcript/23568 40 - 4769792 4785731 255,0,0 5 4724,124,166,155,159 0,7732,12775,14158,15780 1 4773210 4785709 G6.3;UnnamedSample_HQ_transcript/184765 40 - 4773210 4785709 255,0,0 5 1306,124,166,155,137 0,4314,9357,10740,12362 1 4773210 4785709 G6.4;UnnamedSample_HQ_transcript/55166 40 - 4773210 4785709 255,0,0 5 3591,124,166,155,137 0,4314,9357,10740,12362

-bash-4.1$ head hq_transcripts.fasta

UnnamedSample_HQ_transcript/0 GCGCCCGAGCTTAGTTCTGGTGGGCGGACGCCGCGGCTTTAAGCCGGGGCGGGACTGAGC GAGGAGCAGCAGCGGCGTCGGCGGCGAGAGTTCTTGGGGACTCGGGCCCGGATGCTGGGG GCTAGTCGGCAGGTCCCGGCTGGATAAACTCATGGACTTGGCAGCTCAGGCTTGAGAAAA ACACAAACTCAGAAGAAAGATGGCGTTCGTTGCAACCCAGGGGGCCACGGTGGTCGACCA AACCACTCTGATGAAGAAATACCTTCAGTTTGTGGCAGCTCTCACGGATGTGAATACACC TGATGAAACGAAGTTGAAAATGATGCAGGAAGTCAGCGAGAACTTTGAGAATGTCACGTC ATCTCCTCAGTATTCTACATTCCTGGAACACATCATCCCTCGATTTCTTACCTTTCTCCA AGATGGAGAAGTCCAGTTTCTTCAGGAGAAGCCAGCCCAGCAACTACGGAAGCTGGTACT TGAGATACTCCACAGGATTCCAACCAATGAGCATCTGCGTCCTCATACGAAAAACGTTCT

-bash-4.1$ head unpolished.cluster_report.csv cluster_id,read_id,read_type transcript/0,m54345U_200310_133837/123864620/ccs,FL transcript/0,m54345U_200310_133837/13501881/ccs,FL transcript/0,m54345U_200310_133837/4654108/ccs,FL transcript/0,m54345U_200310_133837/1706014/ccs,FL transcript/0,m54345U_200310_133837/159711271/ccs,FL transcript/1,m54345U_200310_133837/18219976/ccs,FL transcript/1,m54345U_200310_133837/24512758/ccs,FL transcript/1,m54345U_200310_133837/134808826/ccs,FL transcript/2,m54345U_200310_133837/20250751/ccs,FL

GenomeRIK commented 4 years ago

Hello Beatriz,

Are you trying to get read support? If so, I suggest using "tama_read_support_levels.py": https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Read-Support

So you would run it like this: python tama_read_support_levels.py -f filelist.txt -o readsupport_all -m trans_red.bed -mt tama

And the "filelist.txt" file would look like this (tab separated): sample1 cluster_report.csv cluster

However, I noticed that the ID's in your "highquality_gmapn1_collapse_capped_trans_read.bed" file do not match the ID's in your "unpolished.cluster_report.csv". So the inputs you used will not work since they need to have ID relationships.

What processing did you do prior to mapping with GMAP (step by step)?

Thank you, Richard

BeatrizdeToledo commented 4 years ago

Good afternoon! Thanks for your quick reply and really amazing tools! I also noticed that the ID's in "highquality_gmapn1_collapse_capped_trans_read.bed" file do not match the ID's in "unpolished.cluster_report.csv". Good to know that because of it the inputs I used will not work since they need to have ID relationships.

The first steps were done by our sequencing facility. Since I can find these "unmatching" IDs already in the files the facility sent me (hq_transcripts.fasta is the high quality transcript fasta i used to aling), I will contact them to find out why this happened. Have you ever seen this type of problem happening?

They used the SMRT link version 8.0.0.80529. the facility performed the CCS run. then with the CCS data set the Isoseq Analysis was done. Then the PACBIO bioinformatic support advised me to align and collapse the data and only after demultiplex for my 3 cell types (I knows many people do not advise to run the cluster/polish step without first separating the samples, but I wanted to try it anyway).

Thank you Beatriz

GenomeRIK commented 4 years ago

Hi Beatriz,

Thank you for using TAMA!

I am not sure I understand what exactly was done with the data prior to mapping and who did each step of the analysis.

From my understanding you are describing this: Your sequencing centre: Processed subreads to CCS reads and then sent you the CCS reads in bam format?

You: You converted the BAM format into fasta and then mapped to the genome?

Or did they send you the fasta file?

I need to know exactly what was done prior to mapping to be able to help you. Could you list it out in detail? Including all the files that you received from the sequencing centre and all the files that you created.

Cheers, Richard

BeatrizdeToledo commented 4 years ago

Sorry if it was not clear enough, That's the email the facility sent me,

"the first step of the analysis was the CCS run You got the CCS report and the m5434U_200310_133837.Q20.fastq file

There additional files from the CCS run available: File Size Type m54345U_200310_133837.Q10.fasta 13 GB Fasta m54345U_200310_133837.Q20.fasta 12 GB Fasta m54345U_200310_133837.ccs.bam 9 GB ConsensusReadBamFile L49711-Cell1 (CCS) 34 KB ConsensusReadSet m54345U_200310_133837.Q10.fastq 26 GB Fastq m54345U_200310_133837.Q20.fastq 23 GB Fastq CCS Per-Read Details 25 MB zip Analysis Log 181 KB log SMRT Link Log 8 KB log

With the CCS data set the Isoseq Analysis was done. You got the Isoseq analysis report, Full length non concatemer reads bam file and the High quality isoforms fasta file.

Further files are available: File Size Type Full-length Non-Concatemer Report 231 MB csv Low-Quality Isoforms 4 MB Fasta Cluster Report 149 MB csv Full-Length Non-Concatemer Reads 7 GB bam Isoform Counts by Barcode 5 MB csv High-Quality Isoforms 708 MB Fasta Analysis Log 36 KB log SMRT Link Log 6 KB log "

Then I High quality isoforms fasta file was mapped to the genome


From: GenomeRIK notifications@github.com Sent: Sunday, April 26, 2020 16:30 To: GenomeRIK/tama Cc: Cardoso de Toledo, Beatriz; Author Subject: Re: [GenomeRIK/tama] key error 'UnnamedSample_HQ_transcript' read support (#24)

Hi Beatriz,

Thank you for using TAMA!

I am not sure I understand what exactly was done with the data prior to mapping and who did each step of the analysis.

From my understanding you are describing this: Your sequencing centre: Processed subreads to CCS reads and then sent you the CCS reads in bam format?

You: You converted the BAM format into fasta and then mapped to the genome?

Or did they send you the fasta file?

I need to know exactly what was done prior to mapping to be able to help you. Could you list it out in detail? Including all the files that you received from the sequencing centre and all the files that you created.

Cheers, Richard

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/GenomeRIK/tama/issues/24#issuecomment-619560965, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AOZRVKRYQONFTY75WKTLJOLRORAR5ANCNFSM4MQFCYPA.

GenomeRIK commented 4 years ago

Hi Beatriz,

Ok it looks like the sequencing centre might have done something non-standard with the data processing after CCS which will make it hard to use the post CCS files for downstream processing.

I suggest you take the CCS bam files and process like so: LIMA (with sample demultiplexing) Refine Minimap2 TAMA Collapse

I would skip the Cluster/Polish step.

As for demultiplexing after cluster/polish, how would you do this? And why do you want to try it out?

Cheers, Richard

BeatrizdeToledo commented 4 years ago

Good afternoon Richard! I finally got a reply explaining the "unmatching ids". So I will just write here and then close the issue.

Basically the hq_transcripts.fasta from a SMRT Link v8.0 Iso-Seq job (using multiple primers for a barcoded library, but no reference genome) has the prefix “>UnnamedSampleHQ” for each isoform. So this is expected to be like this and UnnamedSample_HQ_transcript/0 would be equivalent to transcript/0. The simplest way to solve it is delete UnnamedSampleHQ from the fasta file.

If this doesn't work I will try to do it from the ccs files.

Thanks a lot for quickly replying my questions!