Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
364 stars 81 forks source link

* in Arthropoda.fa sequences and "Assuming this is not a protein fasta file" #721

Closed spoonbender76 closed 9 months ago

spoonbender76 commented 11 months ago

Hi,

I used Arthropoda.fa (from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Arthropoda.fa.gz) for BRAKER --prot_seq=Arthropoda.fa and got the following warnings:"Assuming this is not a DNA fasta file" and "Assuming this is not a protein fasta file".

I checked Arthropoda.fa and found 1414 protein sequences that have * in the middle of them. Does these sequences have negative impacts on gene prediction? If so, should I remove full sequences or just remove *?

I tried running with these sequences excluded and still got "Assuming this is not a DNA fasta file" and "Assuming this is not a protein fasta file".

seqs_with_stop_codons_in_the_middle.fasta.gz

KatharinaHoff commented 11 months ago

The stars are in-frame stop codons. I never checked whether there are any in OrthoDB...

I recommend to remove the entries from the fasta file.

I will - when I find the time - do this systematically for all the ODB11 files that we provide. Might take a little until I get to do it.

It shouldn't cause harm in the output quality if BRAKER still runs, despite the warning. It's more a "not so pretty" issue.

tomasbruna commented 11 months ago

This should be easy to add to https://github.com/tomasbruna/orthodb-clades. I will take a look when I find the time.

KatharinaHoff commented 9 months ago

I fixed it in the current ODB clade files that we provide.

sahoo-rk commented 8 months ago

I downloaded ODB data from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Arthropoda.fa.gz today on 29 Feb 2024. This file still have in frame '*'.

cat Arthropoda.fa | grep "*" -B 1 | head

>1507135_0:00000e
NLLINEKQIRCKSYSRFLDYSKVPKLQENDLEEQHVKGSGPGGQATNKTCNAVVLKHKPTGLIVKCHETRSLFQNRKIARETLLKKK*LLRYNKACLHY
>1507135_0:000034
MMASTNEFGPDSGGRVKGVTIVKPIIYGNVARYFGKKREEDGHTHQWTVYVKPYHNEDMST*VKKVHFKLHESYNNPNRIMTKPPYELTETGWGEFEIVIKIYFHDPNERPVTIYHILKLFQTTPEIQLGKKSLVSEFYEEIVFQDPTALMQHLLNSSRPITLGAWRHNTDFEAKKESTMKAVIEARNKIRVEVIDLKEKLTLAKETIAKFKDEIAKISKAGGSLSVA

cat Arthropoda.fa | grep "*" -B 1 | grep ">" | wc -l
1414

I notice about 1414 sequences in Arthropoda.fa having the stop codon. Whether the '*' has to be fixed in certain way? Or, can I still use the file for BRAKER3 analyses?

Thanks,

KatharinaHoff commented 8 months ago

You can just use them. What I fixed was not the stop codons. I removed the whitespaces from headers, which also caused issues.

On Thu, Feb 29, 2024 at 12:24 PM Ranjit Kumar Sahoo < @.***> wrote:

I downloaded ODB data from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Arthropoda.fa.gz today on 29 Feb 2024. This file still have in frame '*'.

cat Arthropoda.fa | grep "*" -B 1 | head

1507135_0:00000e NLLINEKQIRCKSYSRFLDYSKVPKLQENDLEEQHVKGSGPGGQATNKTCNAVVLKHKPTGLIVKCHETRSLFQNRKIARETLLKKKLLRYNKACLHY 1507135_0:000034 MMASTNEFGPDSGGRVKGVTIVKPIIYGNVARYFGKKREEDGHTHQWTVYVKPYHNEDMSTVKKVHFKLHESYNNPNRIMTKPPYELTETGWGEFEIVIKIYFHDPNERPVTIYHILKLFQTTPEIQLGKKSLVSEFYEEIVFQDPTALMQHLLNSSRPITLGAWRHNTDFEAKKESTMKAVIEARNKIRVEVIDLKEKLTLAKETIAKFKDEIAKISKAGGSLSVA

cat Arthropoda.fa | grep "*" -B 1 | grep ">" | wc -l 1414

I notice about 1414 sequences in Arthropoda.fa having the stop codon. Whether the '*' has to be fixed in certain way? Or, can I still use the file for BRAKER3 analyses?

Thanks,

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/721#issuecomment-1970921094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JDC2MH3PM565VSRGR3YV4HWPAVCNFSM6AAAAABAS6TKM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZQHEZDCMBZGQ . You are receiving this because you modified the open/close state.Message ID: @.***>

sahoo-rk commented 8 months ago

Thanks for the clarification.