Closed zmz1988 closed 2 years ago
Hello,
The answer is in the log output.
Use "-d merge_dup" when running TAMA Merge.
The meaning behind this behavior and what it says about your models is a bit complicated. I am happy to explain but it requires visuals and would be easier to explain over voice communication.
Cheers, Richard
Hi Richard, I'm really sorry to bother you again. I tried with "-d merge_dup" for merging of three files, and tama_merge runs fine. But when I run bedtools getfasta
to produce transcripts for the ORF annotation step, I got 660 more lines with Ns in the transcript fasta file. I don't know how this happened. But when I tried using the tama_merge result of two input files, no lines with Ns in the transcript fasta file.
I followed exact commands as suggested in TAMA GO: ORF and NMD predictions.
Hello,
I am not exactly sure what you mean here. Could you show me some examples?
Cheers, Richard
Thanks for the super fast reply!
(cupcake) $ python /tama/tama_merge.py -f bed_lst.txt -p Col_liftoff_tsebra_sqanti_tama1 -cds PB,Tair,BR -s Tair,BR,PB -d merge_dup &> tama_merge1.log
(cupcake) $ tail tama_merge1.log
1
hunter id unsearched_count: Tair_ATMG01160.1
['Tair_ATMG01160.1']
Tair_ATMG01160.1
running hunter_prey_nocap
1
hunter id unsearched_count: Tair_ATMG01170.1
['Tair_ATMG01170.1']
Tair_ATMG01170.1
TAMA Merge has completed successfully!
(cupcake) $ head sample_liftoff_tsebra_sqanti_tama1.bed
Chr1_RagTag_p 6982 9251 G1;G1.1;AT1G01010;AT1G01010.1 40 + 7111 8982 255,200,0 6 283,281,120,390,153,461 0,365,855,1075,1543,1808
Chr1_RagTag_p 7111 8982 G1;G1.2;g1;g1.t1 40 + 7111 8982 255,0,0 6 154,281,120,390,153,192 0,236,726,946,1414,1679
Chr1_RagTag_p 9279 12090 G2;G2.1;AT1G01020;AT1G01020.1 40 - 10267 12019 255,200,0 10 336,633,76,67,86,74,46,90,48,167 0,510,1230,1457,1637,1835,2015,2309,2490,2644
Chr1_RagTag_p 9780 10008 G2;G2.2;g2;g2.t1 40 - 9780 10008 255,0,0 1 228 0
Chr1_RagTag_p 10142 12090 G2;G2.3;AT1G01020;AT1G01020.2 40 - 10667 12019 255,200,0 8 280,294,86,74,46,90,48,167 0,367,774,972,1152,1446,1627,1781
Chr1_RagTag_p 10267 12019 G2;G2.4;g3;g3.t1 40 - 10267 12019 255,0,0 9 155,76,67,86,74,46,90,48,96 0,242,469,649,847,1027,1321,1502,1656
Chr1_RagTag_p 15001 17065 G3;G3.1;AT1G01030;AT1G01030.1 40 - 15216 16293 255,200,0 2 1523,380 0,1684
(cupcake) $ bedtools getfasta -name -split -s -fi ../../EDTA_RepbaseLib/sample.fasta.mod.MAKER.masked -bed sample_liftoff_tsebra_sqanti_tama1.bed -fo sample_liftoff_tsebra_sqanti_tama1.transcript.fasta
(cupcake) $ grep "N" sample_liftoff_tsebra_sqanti_tama1.transcript.fasta | head
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGGAGGTTTCTTCTTTCTCCACTGTACCCAGATACTGTAACTCACGCTCCTTTGTTCCAGAATTGTCTCGTTTCAGGGGATTCAAAGTTCACTTATGGGACCAATCTCTTGTTCCTCTTCATTTCAGCATTCGAAAACGAAAGAACACCCCTCGTTATTTGATTTCTTGCAGTCAGAAGAAAGATGTCACTGTTGTTGATGGAAGTTGTATGGATGAGATCTATGATAAGTTGGCGGAACGTCTTGTCCCTACAGCAGCGGCAATGTTTAGCCCCAATCTTAAACGTTTGGTCGGTTTGGCTGGTCCTCCTGGTGCCGGGAAAAGTACAGTAGCAAATGAAGTTGTGCGTCGTGTAAATAAGCTATGGCCTCAGAAAGCTGCTTCTTTTGATGCTGAGGTTAATCCACCGGATGTTGCTATAGTACTTCCCATGGATGGTTTCCATTTGTATCGTTCGCAACTCGATGCTATGGAGGATCCCAAAGAAGCCCATGCGAGAAGAGGGGCTCCGTGGACTTTTGATCCTGCACTACTGCTTAACTGTTTAAAGA
(cupcake) $ head sample_liftoff_tsebra_sqanti_tama1.transcript.fasta
>G1;G1.1;AT1G01010;AT1G01010.1::Chr1_RagTag_p:6982-9251(+)
AAATTATTAGATATACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCGAAGCTCTTTACCCGGAAACCATTGAAATCGGACGGTTTAGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTCAACATCTGTAGCTACGATCCTTGGAACTTGCGCTTCCAGTCAAAGTACAAATCGAGAGATGCTATGTGGTACTTCTTCTCTCGTAGAGAAAACAACAAAGGGAATCGACAGAGCAGGACAACGGTTTCTGGTAAATGGAAGCTTACCGGAGAATCTGTTGAGGTCAAGGACCAGTGGGGATTTT
>G1;G1.2;g1;g1.t1::Chr1_RagTag_p:7111-8982(+)
ATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAGCCATCAGCGAGGTCAACATCTGTAGCTACGATCCTTGGAACTTGCGCTTCCAGTCAAAGTACAAATCGAGAGATGCTATGTGGTACTTCTTCTCTCGTAGAGAAAACAACAAAGGGAATCGACAGAGCAGGACAACGGTTTCTGGTAAATGGAAGCTTACCGGAGAATCTGTTGAGGTCAAGGACCAGTGGGGATTTTGTAGTGAGGGCTTTCGTGGTAAGATTGGTCATAAAAGGGTTTTGGTGTTCCTCGATGGAAGATACCCTGACAAAACCAAATCTGATTGGGTTATCCACGAGTTCCACTACGACCTCTTACCAGAACATC
What is the transcript ID for the fasta sequence that has the NNNN's?
And can you show me the bed line corresponding to that transcript model?
It seems like you have models which overlap with N space which is why this is happening. Probably these are coming from your liftoff annotation.
Cheers, Richard
I just checked again where are the Ns in my assembly, and it seems that I don't have any Ns in Chr1. The samples showed below are from Chr1. So I guess it's something else that causes the problem.
>G224;G224.1;AT1G03020;AT1G03020.1;g203;g203.t1::Chr1_RagTag_p:701912-702221(-) ATGGAGAAGATATCAAATTTGTTAGAAGACAAGCCCGTGGTGATATTCAGCAAGACGTCCTGCTGTATGAGTCACTCGATCAAGTCGCTTATATCTGGTTACGGTGCGAATTCAACAGTGTATGAGCTAGACGAAATGTCTAATGGACCAGAGATCGAACGAGCACTTGTAGAGCTTGGGTGCAAACCGACTGTGCCAGCTGTCTTTATAGGGCAAGAGCTCGTAGGTGGTGCAAATCAACTTATGTCTCTTCAAGTCAGGAACCAACTAGCTTCGTTGCTCCGAAGAGCTGGAGCCATATGGATTTAA >G225;G225.1;AT1G03022;AT1G03022.1::Chr1_RagTag_p:702826-702952(+)
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>G226;G226.1;AT1G03030;AT1G03030.1::Chr1_RagTag_p:705183-707570(+)
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGGAGGTTTCTTCTTTCTCCACTGTACCCAGATACTGTAACTCACGCTCCTTTGTTCCAGAATTGTCTCGTTTCAGGGGATTCAAAGTTCACTTATGGGACCAATCTCTTGTTCCTCTTCATTTCAGCATTCGAAAACGAAAGAACACCCCTCGTTATTTGATTTCTTGCAGTCAGAAGAAAGATGTCACTGTTGTTGATGGAAGTTGTATGGATGAGATCTATGATAAGTTGGCGGAACGTCTTGTCCCTACAGCAGCGGCAATGTTTAGCCCCAATCTTAAACGTTTGGTCGGTTTGGCTGGTCCTCCTGGTGCCGGGAAAAGTACAGTAGCAAATGAAGTTGTGCGTCGTGTAAATAAGCTATGGCCTCAGAAAGCTGCTTCTTTTGATGCTGAGGTTAATCCACCGGATGTTGCTATAGTACTTCCCATGGATGGTTTCCATTTGTATCGTTCGCAACTCGATGCTATGGAGGATCCCAAAGAAGCCCATGCGAGAAGAGGGGCTCCGTGGACTTTTGATCCTGCACTACTGCTTAACTGTTTAAAGAAGCTGAAAAACGAGGGCTCAGTTTATGTGCCATCATTTGATCATGGAGTTGGGGATCCAGTTGAAGATGATATCTTTGTTAGCCTCCAGCATAAAGTGGTAATTGTAGAAGGAAACTACATCTTGTTAGAAGAAGGGTCTTGGAAAGACATCTCAGATATGTTTGATGAGAAATGGTTCATCGATGTGAATCTCGATACAGCAATGCAACGTGTTGAAAACCGGCATATCTCAACTGGTAAACCCCCAGATGTTGCTAAATGGCGGGTGGATTATAATGACCGACCCAACGCAGAGCTGATAATAAAGTCAAAGACAAACGCTGATCTACTGATCAGATCTATGAACATTTGAATCTACTTTCAAGGGGGCTATTGGACGAAACAGAATAAGATGAATAACAAAAGACAATGTTGCGCTTCAAGGGAAGATTGCCTTTGGTCTACAATGTCTTTTTCTTGCAACGGAAACTTCCAATAAGCAATTTGTTCGCAGTAAACTGTGTATGCTAGTGTTCTCTGCATACTCACGGATTTTGTTCAGACAATGTGTTTTATACTTCTAAATTAGAAGTTCTAATGCATAAAGTGTTTGAATGCGCGGGAAATTACCTAATTATAATCAGTCCAGCTCCCAACTATATCATTTTG
(cupcake) $ grep "G224.1" sample_liftoff_tsebra_sqanti_tama1.bed
Chr1_RagTag_p 701912 702221 G224;G224.1;AT1G03020;AT1G03020.1;g203;g203.t1 40 - 701912 702221 200,0,255 1 309 0
(cupcake) $ grep "G225.1" sample_liftoff_tsebra_sqanti_tama1.bed
Chr1_RagTag_p 702826 702952 G225;G225.1;AT1G03022;AT1G03022.1 40 + 702826 702952 255,200,0 1 126 0
(cupcake) $ grep "G226.1" sample_liftoff_tsebra_sqanti_tama1.bed
Chr1_RagTag_p 705183 707570 G226;G226.1;AT1G03030;AT1G03030.1 40 + 705326 707274 255,200,0 12 204,101,44,78,193,31,59,74,87,62,29,383 0,297,498,628,798,1089,1201,1350,1534,1717,1875,2004
If you run the bed to fasta on just the liftoff bed file do you get the same thing?
This doesn't seem like a TAMA issue but rather something weird going on with the matching of the coordinates to the genome assembly. It may be possible that these models are mapping off of the contigs in your assembly? What is the length of "Chr1_RagTag_p"?
If it's only the liftoff bed file, no Ns is found. Chr1 is 32 Mbp.
Are the bed line coordinates the same for the same transcript between the liftoff only and the merge file? Can you show me some comparisons?
Here it is. Use the above G224.1 and G225.1 as examples.
#G224.1 in the liftoff bed file
(cupcake) $ grep "AT1G03020.1" sample_liftoff_Namecorrected.gff3_polished.bed
Chr1_RagTag_p 701912 702221 AT1G03020;AT1G03020.1;AT1G03020.1;mRNA 40 - 701912 702221 255,0,0 1 309 0
(cupcake) $ grep "G224.1" sample_liftoff_tsebra_sqanti_tama1.bed
Chr1_RagTag_p 701912 702221 G224;G224.1;AT1G03020;AT1G03020.1;g203;g203.t1 40 - 701912 702221 200,0,255 1 309 0
#G225.1 in the liftoff bed file
(cupcake) $ grep "AT1G03022.1" sample_liftoff_Namecorrected.gff3_polished.bed
Chr1_RagTag_p 702826 702952 AT1G03022;AT1G03022.1;AT1G03022.1;mRNA 40 + 702826 702952 255,0,0 1 126 0
(cupcake) $ grep "G225.1" sample_liftoff_tsebra_sqanti_tama1.bed
Chr1_RagTag_p 702826 702952 G225;G225.1;AT1G03022;AT1G03022.1 40 + 702826 702952 255,200,0 1 126 0
Ok so the coordinates are not changing at all. This means the error is coming from somewhere else.
You can see that the models haven't changed so this has nothing to do with TAMA Merge.
Where there any differences between the bed to fasta runs? Did you do them all at the same time?
What happens if you re-run all the bed to fasta in a new folder and make sure to use new instances of the genome assembly?
I will try to help you a bit but this is clearly something weird with the bed to fasta runs and not the TAMA Merge output (at least from the information I am seeing in this thread).
Thanks, Richard! I get your point, and this is also what confused me. The environment, the bedtools getfasta
options, the input genome, and the working directory are exactly the same for all of above runs mentioned. As you suggested, I also re-run all the bed to fasta in a new folder, I get the same fasta files with Ns...
At the beginning I thought it's probably related to -d merge_dup
, as it only happened when I have to use this option. But you're right that all the intervals are the same as in the original bed file, and this is what bedtools only considers when extracting sequences. I guess it's better I don't occupy you for this matter. I will try to do some tests around and see what the problem is. Thanks a lot!
Ok I am closing this thread now but feel free to start a new now if you have other issues.
Hi Richard, sorry it's me again. I'm trying to collect all the transcripts from short-read prediction, long-read transcripts and the reference (the liftoff gff3 file). But when I input three files, tama_merge did not finished and only outputted 35 genes from chr1. But then I tried to use only two files (the first two in below
bed_lst.txt
file), and tama_merge completed the work. So I'm wondering whether it's the setting of my bed_list is not right, or tama_merge can't deal with merge fo three files.Here is my code for the three merging files