Closed spoonbender76 closed 9 months ago
The stars are in-frame stop codons. I never checked whether there are any in OrthoDB...
I recommend to remove the entries from the fasta file.
I will - when I find the time - do this systematically for all the ODB11 files that we provide. Might take a little until I get to do it.
It shouldn't cause harm in the output quality if BRAKER still runs, despite the warning. It's more a "not so pretty" issue.
This should be easy to add to https://github.com/tomasbruna/orthodb-clades. I will take a look when I find the time.
I fixed it in the current ODB clade files that we provide.
I downloaded ODB data from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Arthropoda.fa.gz today on 29 Feb 2024. This file still have in frame '*'.
cat Arthropoda.fa | grep "*" -B 1 | head
>1507135_0:00000e
NLLINEKQIRCKSYSRFLDYSKVPKLQENDLEEQHVKGSGPGGQATNKTCNAVVLKHKPTGLIVKCHETRSLFQNRKIARETLLKKK*LLRYNKACLHY
>1507135_0:000034
MMASTNEFGPDSGGRVKGVTIVKPIIYGNVARYFGKKREEDGHTHQWTVYVKPYHNEDMST*VKKVHFKLHESYNNPNRIMTKPPYELTETGWGEFEIVIKIYFHDPNERPVTIYHILKLFQTTPEIQLGKKSLVSEFYEEIVFQDPTALMQHLLNSSRPITLGAWRHNTDFEAKKESTMKAVIEARNKIRVEVIDLKEKLTLAKETIAKFKDEIAKISKAGGSLSVA
cat Arthropoda.fa | grep "*" -B 1 | grep ">" | wc -l
1414
I notice about 1414 sequences in Arthropoda.fa
having the stop codon. Whether the '*' has to be fixed in certain way? Or, can I still use the file for BRAKER3 analyses?
Thanks,
You can just use them. What I fixed was not the stop codons. I removed the whitespaces from headers, which also caused issues.
On Thu, Feb 29, 2024 at 12:24 PM Ranjit Kumar Sahoo < @.***> wrote:
I downloaded ODB data from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Arthropoda.fa.gz today on 29 Feb 2024. This file still have in frame '*'.
cat Arthropoda.fa | grep "*" -B 1 | head
1507135_0:00000e NLLINEKQIRCKSYSRFLDYSKVPKLQENDLEEQHVKGSGPGGQATNKTCNAVVLKHKPTGLIVKCHETRSLFQNRKIARETLLKKKLLRYNKACLHY 1507135_0:000034 MMASTNEFGPDSGGRVKGVTIVKPIIYGNVARYFGKKREEDGHTHQWTVYVKPYHNEDMSTVKKVHFKLHESYNNPNRIMTKPPYELTETGWGEFEIVIKIYFHDPNERPVTIYHILKLFQTTPEIQLGKKSLVSEFYEEIVFQDPTALMQHLLNSSRPITLGAWRHNTDFEAKKESTMKAVIEARNKIRVEVIDLKEKLTLAKETIAKFKDEIAKISKAGGSLSVA
cat Arthropoda.fa | grep "*" -B 1 | grep ">" | wc -l 1414
I notice about 1414 sequences in Arthropoda.fa having the stop codon. Whether the '*' has to be fixed in certain way? Or, can I still use the file for BRAKER3 analyses?
Thanks,
— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/721#issuecomment-1970921094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JDC2MH3PM565VSRGR3YV4HWPAVCNFSM6AAAAABAS6TKM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZQHEZDCMBZGQ . You are receiving this because you modified the open/close state.Message ID: @.***>
Thanks for the clarification.
Hi,
I used Arthropoda.fa (from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Arthropoda.fa.gz) for BRAKER
--prot_seq=Arthropoda.fa
and got the following warnings:"Assuming this is not a DNA fasta file" and "Assuming this is not a protein fasta file".I checked Arthropoda.fa and found 1414 protein sequences that have
*
in the middle of them. Does these sequences have negative impacts on gene prediction? If so, should I remove full sequences or just remove*
?I tried running with these sequences excluded and still got "Assuming this is not a DNA fasta file" and "Assuming this is not a protein fasta file".
seqs_with_stop_codons_in_the_middle.fasta.gz