gatech-genemark / ProtHint

Protein hint generation pipeline for gene finding in eukaryotic genomes
Other
56 stars 13 forks source link

Preparation of database from OrthoDB #39

Closed dariober closed 1 year ago

dariober commented 2 years ago

Hello - My question is similar to the #37. I'm using ProtHint via braker2 and I wanted to ask about the preparation of the protein database starting from OrthoDB. The documentation says:

We recommend to use a relevant portion of OrthoDB protein database as the source of reference protein sequences.

Could you explain how to extract the relevant portion? For example, if I wanted to extract the protozoa proteins, how should I proceed? (Apologies if this question is more suited to OrthoDB).

More in general, how broad or specific should the database be? For example, if I wanted to annotate Plasmodium, would it be better to use all the protozoa proteins or better be more specific? Thanks!

tomasbruna commented 2 years ago

Hi @dariober,

take a look at this section in the readme: https://github.com/gatech-genemark/ProtHint#protein-database-preparation

So in your case, you'd want to use

wget https://v100.orthodb.org/download/odb10_protozoa_fasta.tar.gz

More in general, how broad or specific should the database be? For example, if I wanted to annotate Plasmodium, would it be better to use all the protozoa proteins or better be more specific? Thanks!

In this case, I'd select just proteins from the Apicomplexa phylum. Adding more would probably just increase runtime and maybe add a little noise. However, if you don't want to go through the filtering effort, using all protozoa proteins will work fine as well.

Best, Tomas

dariober commented 2 years ago

@tomasbruna Thanks for getting back to me. I saw that the protozoa database is available for download. My question is how to create it in the first place. In fact, if I wanted to parse OrthoDB to get the Apicomplexa proteins, how would I proceed?

tomasbruna commented 2 years ago

To prepare the full protozoa database, you just concatenate all fasta files into one big file, like so:

cat protozoa/Rawdata/* > proteins.fasta

That's the "database" you give to BRAKER2.

if I wanted to parse OrthoDB to get the Apicomplexa proteins, how would I proceed

In this case, you would only concatenate protein files of Apicomplexa species. There is one file per species in the protozoa/Rawdata/ folder and the files are named by the species ID. The species name to species ID mapping is available here.

dariober commented 2 years ago

Ok, I think I haven't made myself clear but I think I get it... I understand that the database is a fasta file of proteins related to the genome to be annotated. What wasn't clear to me is how you can compile such file in a relatively straightforward way. Your comment:

The species name to species ID mapping is available here.

tells me that one needs to collect all the taxonomy IDs for the species of interest (e.g. all the Apicomplexa) and then query the file odb10v1_all_fasta.tab.gz to get those species (by the way, I didn't realize that the OrthoDB files are named after the taxonomy ID). If that's the way to go (or at least a way to go) that's fine, I was wondering if there was a more immediate solution. Thanks-!

tomasbruna commented 2 years ago

Yes, that's the way to go. The only tedious step is to collect species names/IDs of all Apicomplexa in OrthoDB. You can use OrthoDB's advanced search for that. It will give you a hierarchy looking like this: image

dariober commented 2 years ago

If any useful, here are a few commands to query the data files from OrthoDB to extract the protein sequences belonging to a given taxonomic group. For example, we want all the proteins under the "Apicomplexa" group.

First query odb10v1_level2species.tab.gz to get the taxonomy ID of Apicomplexa:

GROUP="Apicomplexa"
level=`zcat odb10v1_levels.tab.gz | awk -v grp=$GROUP '$2 == grp'`
taxid=`echo $level | cut -d ' ' -f 1`

Then query odb10v1_level2species.tab.gz to get all the species under the Apicomplexa group. The awk command may be more convoluted than necessary:

species=`zcat odb10v1_level2species.tab.gz \
| awk -v taxid=$taxid '$4 ~ "{"$taxid"}" || 
                       $4 ~ "{"taxid"," || 
                       $4 ~ ","taxid"," || 
                       $4 ~ ","taxid"}" {print $2}'`

You may want to check that the number of species retrieved is the same as one reported in odb10v1_levels.tab.gz, 54 in this case:

echo $species | wc -w
echo $level

Prepare a regex of species IDs and query the (big) file odb10v1_all_fasta.tab. Conveniently, odb10v1_all_fasta.tab has sequences on a single line instead of being wrapped so we can use grep -A 1 ... (you may want to check this format doesn't change in the future).

regex=`echo $species | tr ' ' '|'`
grep --no-group-separator -A 1 -P -w "$regex" odb10v1_all_fasta.tab > "$GROUP.fasta"

(Not fully tested - suggestions welcome)

reubwn commented 2 years ago

@dariober thanks this was very helpful!

tomasbruna commented 1 year ago

This should now be way easier thanks to https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/ and https://github.com/tomasbruna/orthodb-clades

Xacfran commented 1 year ago

I used the links suggested by Thomas Bruna to download the vertebrate database and subset a fasta file for mammals only. This worked as a charm for me. Thanks a lot for the code @dariober !