GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
128 stars 25 forks source link

Could tama_merge merge three files? #76

Closed zmz1988 closed 2 years ago

zmz1988 commented 2 years ago

Hi Richard, sorry it's me again. I'm trying to collect all the transcripts from short-read prediction, long-read transcripts and the reference (the liftoff gff3 file). But when I input three files, tama_merge did not finished and only outputted 35 genes from chr1. But then I tried to use only two files (the first two in below bed_lst.txt file), and tama_merge completed the work. So I'm wondering whether it's the setting of my bed_list is not right, or tama_merge can't deal with merge fo three files.

Here is my code for the three merging files

$ python /tama/tama_merge.py -f bed_lst.txt -p Sample_liftoff_tsebra_sqanti_tama -cds SQ,TS,LO -s LO,TS,SQ &> tama_merge.log

# the bed_lst.txt file
$ head bed_lst.txt
sample_braker_tsebra.gff3.bed   no_cap  1,1,1   TS
sample_classification.filtered_lite.gtf.bed capped  1,1,1   SQ
sample_liftoff.gff3_polished.bed    no_cap  2,1,1   LO

$ tail tama_merge.log
Chr1_RagTag_p   120384  122119  G36;G36.tmp.2   40  +   120384  122119  200,0,255   1   1735    0
a###########################################
Chr1_RagTag_p   120415  121873  TS_g30.t1;TS_g30.t1 40  +   120415  121873  255,0,0 1   1735    -31
Chr1_RagTag_p   120293  122115  LO_AT1G01300.1;LO_AT1G01300.1   40  +   120293  122115  255,0,0 1   1822    0
b###########################################
Chr1_RagTag_p   120415  121873  TS_g30.t1;TS_g30.t1 40  +   120415  121873  255,0,0 1   1735    -31
Chr1_RagTag_p   120384  122119  SQ_PB.6.1;SQ_PB.6.1 40  +   120384  122119  255,0,0 1   1735    0
By default TAMA merge does not allow merging of duplicate transcript groups.
Duplicate transcript groups occur when different groupings of transcripts results in the same collapsed model.
If you would like to merge duplicate transcript groups please add -d merge_dup to the arguments.
GenomeRIK commented 2 years ago

Hello,

The answer is in the log output.

Use "-d merge_dup" when running TAMA Merge.

The meaning behind this behavior and what it says about your models is a bit complicated. I am happy to explain but it requires visuals and would be easier to explain over voice communication.

Cheers, Richard

zmz1988 commented 2 years ago

Hi Richard, I'm really sorry to bother you again. I tried with "-d merge_dup" for merging of three files, and tama_merge runs fine. But when I run bedtools getfasta to produce transcripts for the ORF annotation step, I got 660 more lines with Ns in the transcript fasta file. I don't know how this happened. But when I tried using the tama_merge result of two input files, no lines with Ns in the transcript fasta file.

I followed exact commands as suggested in TAMA GO: ORF and NMD predictions.

GenomeRIK commented 2 years ago

Hello,

I am not exactly sure what you mean here. Could you show me some examples?

Cheers, Richard

zmz1988 commented 2 years ago

Thanks for the super fast reply!

(cupcake) $ python /tama/tama_merge.py -f bed_lst.txt -p Col_liftoff_tsebra_sqanti_tama1 -cds PB,Tair,BR -s Tair,BR,PB -d merge_dup &> tama_merge1.log 

(cupcake) $ tail tama_merge1.log
1
hunter id unsearched_count: Tair_ATMG01160.1
['Tair_ATMG01160.1']
Tair_ATMG01160.1
running hunter_prey_nocap
1
hunter id unsearched_count: Tair_ATMG01170.1
['Tair_ATMG01170.1']
Tair_ATMG01170.1
TAMA Merge has completed successfully!

(cupcake) $ head sample_liftoff_tsebra_sqanti_tama1.bed
Chr1_RagTag_p   6982    9251    G1;G1.1;AT1G01010;AT1G01010.1   40  +   7111    8982    255,200,0   6   283,281,120,390,153,461 0,365,855,1075,1543,1808
Chr1_RagTag_p   7111    8982    G1;G1.2;g1;g1.t1    40  +   7111    8982    255,0,0 6   154,281,120,390,153,192 0,236,726,946,1414,1679
Chr1_RagTag_p   9279    12090   G2;G2.1;AT1G01020;AT1G01020.1   40  -   10267   12019   255,200,0   10  336,633,76,67,86,74,46,90,48,167    0,510,1230,1457,1637,1835,2015,2309,2490,2644
Chr1_RagTag_p   9780    10008   G2;G2.2;g2;g2.t1    40  -   9780    10008   255,0,0 1   228 0
Chr1_RagTag_p   10142   12090   G2;G2.3;AT1G01020;AT1G01020.2   40  -   10667   12019   255,200,0   8   280,294,86,74,46,90,48,167  0,367,774,972,1152,1446,1627,1781
Chr1_RagTag_p   10267   12019   G2;G2.4;g3;g3.t1    40  -   10267   12019   255,0,0 9   155,76,67,86,74,46,90,48,96 0,242,469,649,847,1027,1321,1502,1656
Chr1_RagTag_p   15001   17065   G3;G3.1;AT1G01030;AT1G01030.1   40  -   15216   16293   255,200,0   2   1523,380    0,1684

(cupcake) $ bedtools getfasta -name -split -s -fi ../../EDTA_RepbaseLib/sample.fasta.mod.MAKER.masked -bed sample_liftoff_tsebra_sqanti_tama1.bed -fo sample_liftoff_tsebra_sqanti_tama1.transcript.fasta

(cupcake) $ grep "N" sample_liftoff_tsebra_sqanti_tama1.transcript.fasta | head
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGGAGGTTTCTTCTTTCTCCACTGTACCCAGATACTGTAACTCACGCTCCTTTGTTCCAGAATTGTCTCGTTTCAGGGGATTCAAAGTTCACTTATGGGACCAATCTCTTGTTCCTCTTCATTTCAGCATTCGAAAACGAAAGAACACCCCTCGTTATTTGATTTCTTGCAGTCAGAAGAAAGATGTCACTGTTGTTGATGGAAGTTGTATGGATGAGATCTATGATAAGTTGGCGGAACGTCTTGTCCCTACAGCAGCGGCAATGTTTAGCCCCAATCTTAAACGTTTGGTCGGTTTGGCTGGTCCTCCTGGTGCCGGGAAAAGTACAGTAGCAAATGAAGTTGTGCGTCGTGTAAATAAGCTATGGCCTCAGAAAGCTGCTTCTTTTGATGCTGAGGTTAATCCACCGGATGTTGCTATAGTACTTCCCATGGATGGTTTCCATTTGTATCGTTCGCAACTCGATGCTATGGAGGATCCCAAAGAAGCCCATGCGAGAAGAGGGGCTCCGTGGACTTTTGATCCTGCACTACTGCTTAACTGTTTAAAGA

(cupcake) $ head sample_liftoff_tsebra_sqanti_tama1.transcript.fasta
>G1;G1.1;AT1G01010;AT1G01010.1::Chr1_RagTag_p:6982-9251(+)
AAATTATTAGATATACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCGAAGCTCTTTACCCGGAAACCATTGAAATCGGACGGTTTAGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTCAACATCTGTAGCTACGATCCTTGGAACTTGCGCTTCCAGTCAAAGTACAAATCGAGAGATGCTATGTGGTACTTCTTCTCTCGTAGAGAAAACAACAAAGGGAATCGACAGAGCAGGACAACGGTTTCTGGTAAATGGAAGCTTACCGGAGAATCTGTTGAGGTCAAGGACCAGTGGGGATTTT
>G1;G1.2;g1;g1.t1::Chr1_RagTag_p:7111-8982(+)
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTCAACATCTGTAGCTACGATCCTTGGAACTTGCGCTTCCAGTCAAAGTACAAATCGAGAGATGCTATGTGGTACTTCTTCTCTCGTAGAGAAAACAACAAAGGGAATCGACAGAGCAGGACAACGGTTTCTGGTAAATGGAAGCTTACCGGAGAATCTGTTGAGGTCAAGGACCAGTGGGGATTTTGTAGTGAGGGCTTTCGTGGTAAGATTGGTCATAAAAGGGTTTTGGTGTTCCTCGATGGAAGATACCCTGACAAAACCAAATCTGATTGGGTTATCCACGAGTTCCACTACGACCTCTTACCAGAACATC
GenomeRIK commented 2 years ago

What is the transcript ID for the fasta sequence that has the NNNN's?

And can you show me the bed line corresponding to that transcript model?

It seems like you have models which overlap with N space which is why this is happening. Probably these are coming from your liftoff annotation.

Cheers, Richard

zmz1988 commented 2 years ago

I just checked again where are the Ns in my assembly, and it seems that I don't have any Ns in Chr1. The samples showed below are from Chr1. So I guess it's something else that causes the problem.

>G224;G224.1;AT1G03020;AT1G03020.1;g203;g203.t1::Chr1_RagTag_p:701912-702221(-) ATGGAGAAGATATCAAATTTGTTAGAAGACAAGCCCGTGGTGATATTCAGCAAGACGTCCTGCTGTATGAGTCACTCGATCAAGTCGCTTATATCTGGTTACGGTGCGAATTCAACAGTGTATGAGCTAGACGAAATGTCTAATGGACCAGAGATCGAACGAGCACTTGTAGAGCTTGGGTGCAAACCGACTGTGCCAGCTGTCTTTATAGGGCAAGAGCTCGTAGGTGGTGCAAATCAACTTATGTCTCTTCAAGTCAGGAACCAACTAGCTTCGTTGCTCCGAAGAGCTGGAGCCATATGGATTTAA >G225;G225.1;AT1G03022;AT1G03022.1::Chr1_RagTag_p:702826-702952(+)
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>G226;G226.1;AT1G03030;AT1G03030.1::Chr1_RagTag_p:705183-707570(+)
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGGAGGTTTCTTCTTTCTCCACTGTACCCAGATACTGTAACTCACGCTCCTTTGTTCCAGAATTGTCTCGTTTCAGGGGATTCAAAGTTCACTTATGGGACCAATCTCTTGTTCCTCTTCATTTCAGCATTCGAAAACGAAAGAACACCCCTCGTTATTTGATTTCTTGCAGTCAGAAGAAAGATGTCACTGTTGTTGATGGAAGTTGTATGGATGAGATCTATGATAAGTTGGCGGAACGTCTTGTCCCTACAGCAGCGGCAATGTTTAGCCCCAATCTTAAACGTTTGGTCGGTTTGGCTGGTCCTCCTGGTGCCGGGAAAAGTACAGTAGCAAATGAAGTTGTGCGTCGTGTAAATAAGCTATGGCCTCAGAAAGCTGCTTCTTTTGATGCTGAGGTTAATCCACCGGATGTTGCTATAGTACTTCCCATGGATGGTTTCCATTTGTATCGTTCGCAACTCGATGCTATGGAGGATCCCAAAGAAGCCCATGCGAGAAGAGGGGCTCCGTGGACTTTTGATCCTGCACTACTGCTTAACTGTTTAAAGAAGCTGAAAAACGAGGGCTCAGTTTATGTGCCATCATTTGATCATGGAGTTGGGGATCCAGTTGAAGATGATATCTTTGTTAGCCTCCAGCATAAAGTGGTAATTGTAGAAGGAAACTACATCTTGTTAGAAGAAGGGTCTTGGAAAGACATCTCAGATATGTTTGATGAGAAATGGTTCATCGATGTGAATCTCGATACAGCAATGCAACGTGTTGAAAACCGGCATATCTCAACTGGTAAACCCCCAGATGTTGCTAAATGGCGGGTGGATTATAATGACCGACCCAACGCAGAGCTGATAATAAAGTCAAAGACAAACGCTGATCTACTGATCAGATCTATGAACATTTGAATCTACTTTCAAGGGGGCTATTGGACGAAACAGAATAAGATGAATAACAAAAGACAATGTTGCGCTTCAAGGGAAGATTGCCTTTGGTCTACAATGTCTTTTTCTTGCAACGGAAACTTCCAATAAGCAATTTGTTCGCAGTAAACTGTGTATGCTAGTGTTCTCTGCATACTCACGGATTTTGTTCAGACAATGTGTTTTATACTTCTAAATTAGAAGTTCTAATGCATAAAGTGTTTGAATGCGCGGGAAATTACCTAATTATAATCAGTCCAGCTCCCAACTATATCATTTTG

(cupcake) $ grep "G224.1" sample_liftoff_tsebra_sqanti_tama1.bed 
Chr1_RagTag_p   701912  702221  G224;G224.1;AT1G03020;AT1G03020.1;g203;g203.t1  40  -   701912  702221  200,0,255   1   309 0

(cupcake) $ grep "G225.1" sample_liftoff_tsebra_sqanti_tama1.bed 
Chr1_RagTag_p   702826  702952  G225;G225.1;AT1G03022;AT1G03022.1   40  +   702826  702952  255,200,0   1   126 0

(cupcake) $ grep "G226.1" sample_liftoff_tsebra_sqanti_tama1.bed 
Chr1_RagTag_p   705183  707570  G226;G226.1;AT1G03030;AT1G03030.1   40  +   705326  707274  255,200,0   12  204,101,44,78,193,31,59,74,87,62,29,383 0,297,498,628,798,1089,1201,1350,1534,1717,1875,2004
GenomeRIK commented 2 years ago

If you run the bed to fasta on just the liftoff bed file do you get the same thing?

This doesn't seem like a TAMA issue but rather something weird going on with the matching of the coordinates to the genome assembly. It may be possible that these models are mapping off of the contigs in your assembly? What is the length of "Chr1_RagTag_p"?

zmz1988 commented 2 years ago

If it's only the liftoff bed file, no Ns is found. Chr1 is 32 Mbp.

GenomeRIK commented 2 years ago

Are the bed line coordinates the same for the same transcript between the liftoff only and the merge file? Can you show me some comparisons?

zmz1988 commented 2 years ago

Here it is. Use the above G224.1 and G225.1 as examples.

#G224.1 in the liftoff bed file
(cupcake) $ grep "AT1G03020.1" sample_liftoff_Namecorrected.gff3_polished.bed 
Chr1_RagTag_p   701912  702221  AT1G03020;AT1G03020.1;AT1G03020.1;mRNA  40  -   701912  702221  255,0,0 1   309 0

(cupcake) $ grep "G224.1" sample_liftoff_tsebra_sqanti_tama1.bed 
Chr1_RagTag_p   701912  702221  G224;G224.1;AT1G03020;AT1G03020.1;g203;g203.t1  40  -   701912  702221  200,0,255   1   309 0

#G225.1 in the liftoff bed file
(cupcake) $ grep "AT1G03022.1" sample_liftoff_Namecorrected.gff3_polished.bed 
Chr1_RagTag_p   702826  702952  AT1G03022;AT1G03022.1;AT1G03022.1;mRNA  40  +   702826  702952  255,0,0 1   126 0

(cupcake) $ grep "G225.1" sample_liftoff_tsebra_sqanti_tama1.bed 
Chr1_RagTag_p   702826  702952  G225;G225.1;AT1G03022;AT1G03022.1   40  +   702826  702952  255,200,0   1   126 0
GenomeRIK commented 2 years ago

Ok so the coordinates are not changing at all. This means the error is coming from somewhere else.

You can see that the models haven't changed so this has nothing to do with TAMA Merge.

Where there any differences between the bed to fasta runs? Did you do them all at the same time?

What happens if you re-run all the bed to fasta in a new folder and make sure to use new instances of the genome assembly?

I will try to help you a bit but this is clearly something weird with the bed to fasta runs and not the TAMA Merge output (at least from the information I am seeing in this thread).

zmz1988 commented 2 years ago

Thanks, Richard! I get your point, and this is also what confused me. The environment, the bedtools getfasta options, the input genome, and the working directory are exactly the same for all of above runs mentioned. As you suggested, I also re-run all the bed to fasta in a new folder, I get the same fasta files with Ns...

At the beginning I thought it's probably related to -d merge_dup, as it only happened when I have to use this option. But you're right that all the intervals are the same as in the original bed file, and this is what bedtools only considers when extracting sequences. I guess it's better I don't occupy you for this matter. I will try to do some tests around and see what the problem is. Thanks a lot!

GenomeRIK commented 2 years ago

Ok I am closing this thread now but feel free to start a new now if you have other issues.