gatech-genemark / ProtHint

Protein hint generation pipeline for gene finding in eukaryotic genomes
Other
56 stars 13 forks source link

Protein database preparation #37

Closed minhasbushra closed 2 years ago

minhasbushra commented 2 years ago

Hi,

With newer version of orthodb 10.1, the option "wget https://v100.orthodb.org/download/odb10_vertebrata_fasta.tar.gz" is not working for preparing protein database. I have tried with modified link but its not working. Any suggestions. ?

Thanks Bushra

tomasbruna commented 2 years ago

Hi Bushra,

do you mean that the command

wget https://v100.orthodb.org/download/odb10_vertebrata_fasta.tar.gz

is not working or that it is not working when you modify 10 to 10.1?

I've checked the difference between OrthoDB 10 and 10.1 before and my conclusion was that the eukaryotic protein sequences are identical between v10 and v10.1 (prokaryotic proteins did change). Some of the eukaryotic proteins are formatted differently in v10.1, but this should not affect ProtHint. For this reason, it should be fine to keep using the v10 link. Obviously, I could be wrong in my analysis, so please double-check.

Best, Tomas

minhasbushra commented 2 years ago

thanks for your reply. The command that I wrote worked, but it didn't work with the new version 10.1. Also, I have a question, I am working on fish so taking "Vertebrata" from OrthoDb. Would it be good to add additional close relative fish protein sequences along with the orthoDB vertebrate? (the closely related species is also in orthodb list).

Thanks

tomasbruna commented 2 years ago

Would it be good to add additional close relative fish protein sequences along with the orthoDB vertebrate?

Yes, that would make sense, if they are not already covered by OrthoDB.

tomasbruna commented 2 years ago

I just saw a paper that used BRAKER2 and a similar strategy of protein preparation for a fish annotation (with good results):

For protein evidence, manually annotated and reviewed protein records from UniProtKB/Swiss-Prot (UniProt Consortium, 2021) as of January 11, 2021 (563,972 sequences) in addition to the proteomes of the false clownfish (A. ocellaris: 48,668), zebrafish (Danio rerio: 88,631), spiny chromis damselfish (Acanthochromis polyacanthus: 36,648), Nile tilapia (Oreochromis niloticus: 63,760), Japanese rice fish (Oryzias latipes: 47,623), rainbow fish (Poecilia reticulata: 45,692), bicolor damselfish (Stegastes partitus: 31,760), tiger puffer (Takifugu rubripes: 49,529), and Atlantic salmon (Salmo salar: 112,302) from the NCBI protein database (https://www.ncbi.nlm.nih.gov/protein) were used.

https://www.biorxiv.org/content/10.1101/2022.01.16.476524v1

They used UniProtKB/Swiss-Prot as the large protein source, but OrthoDB should work just as well, if not better.

Tomas

minhasbushra commented 2 years ago

Thanks !!