B-UMMI / chewBBACA

BSR-Based Allele Calling Algorithm
GNU General Public License v3.0
133 stars 27 forks source link

UniprotFinder: running chewBAACA without internet access #121

Open fedex88 opened 2 years ago

fedex88 commented 2 years ago

Hello @rfm-targa,

While running chewBAACA-2.8.5 (installed via pip install --user) on a cluster where compute nodes don't have access to internet, I had the following error :

chewBBACA version: 2.8.5
Authors: Mickael Silva, Pedro Cerqueira, Rafael Mamede
Github: https://github.com/B-UMMI/chewBBACA
Wiki: https://github.com/B-UMMI/chewBBACA/wiki
Tutorial: https://github.com/B-UMMI/chewBBACA_tutorial
Contacts: imm-bioinfo@medicina.ulisboa.pt

=============================
  chewBBACA - UniprotFinder
=============================
Started at: 2022-03-09T09:59:47

Schema: scheme_104/schema_seed
Number of loci: 17686
Translating representative sequences...done.
Downloading list of reference proteomes...<urlopen error ftp error: TimeoutError(110, 'Connection timed out')>
<urlopen error ftp error: TimeoutError(110, 'Connection timed out')>
<urlopen error ftp error: TimeoutError(110, 'Connection timed out')>
<urlopen error ftp error: TimeoutError(110, 'Connection timed out')>
done.

Would it be possible to use local files instead of downloaded ones ? (maybe using a specific option that points to it) Also, which is the required list of reference proteomes from uniprot?

Thank you so much for your help

Best, Federica Palma

rfm-targa commented 2 years ago

Hello @fedex88,

The UniprotFinder process needs internet access to download the reference proteomes for the taxon or taxa passed to the --taxa parameter. It downloads the README file at ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/ and selects the reference proteomes with a Species Name that contains any of the terms passed to the --taxa parameter. We can add an option to run it in offline mode where it accepts local files to annotate the loci. However, the process also sends requests to UniProt's SPARQL endpoint to get annotation terms based on protein exact matches. The offline mode would have to ignore the functionalities to annotate based on reference proteomes and exact matches through the SPARQL endpoint and would be a simple BLAST against a set of local Fasta files (or Genbank files). It is not difficult to implement and would be important for use cases such as yours. I will add it to the list of functionalities that we have to implement, but for now it will only be possible to run the process with internet access. I will update this issue when it has been implemented. Any suggestions to how the offline mode should work are welcomed.

Best,

Rafael