HobnobMancer / cazy_webscraper

Web scraper to retrieve protein data catalogued by the CAZy, UniProt, NCBI, GTDB and PDB websites/databases.
https://hobnobmancer.github.io/cazy_webscraper/
MIT License
12 stars 3 forks source link

Failing to retrieve UniProt data #100

Closed HobnobMancer closed 1 year ago

HobnobMancer commented 1 year ago

Describe the bug

When using cw_get_uniprot_data to retrieve data from UniProt, no data is retrieved and added to the local CAZyme database

To Reproduce

  1. Build a local CAZyme database: cazy_webscraper <email> -o cazy.db
  2. cw_get_uniprot_data cazy.db --families 20 --pdb
    Built output directory: .cazy_webscraper_2022-11-18_20-03-08/uniprot_data_retrieval
    Using default CAZy class synonyms
    Retrieving GenBank accessions for selected CAZy classes: 0it [00:00, ?it/s]
    Applying CAZy family filter(s)
    Retrieving GenBank accessions for selected CAZy families:   0%|                                                           | 0/1 [00:00<?, ?it/s]Retrieving CAZymes for CAZy family PL20
    Retrieving GenBank accessions for selected CAZy families: 100%|███████████████████████████████████████████████████| 1/1 [00:00<00:00, 20.02it/s]
    Applying no taxonomic filters
    Retrieving UniProt data for 76
    Batch retrieving UniProt IDs: 11it [00:00, 15.03it/s]                                                                                           
    Batch retrieving protein data from UniProt: 0it [00:00, ?it/s]
    Adding data to the local CAZyme database
    Retrieving existing UniProt records from db: 0it [00:00, ?it/s]
    Separating new and existing records: 0it [00:00, ?it/s]
    Loading existing PDB db records: 0it [00:00, ?it/s]
    Identifying new PDBs to add to db: 0it [00:00, ?it/s]
    Loading existing Genbank_Pdbs db records: 0it [00:00, ?it/s]
    Identifying new protein-PDB relationships to add to db: 0it [00:00, ?it/s]

    No data is retrieved from UniProt.

Expected behavior

Retrieve data from UniProt and add to the local CAZyme database

mherold1 commented 1 year ago

Hi, when testing the tool I noticed that I had problems at this step and after searching I noticed that the requests via: get_uniprot_accessions() from https://github.com/HobnobMancer/saintBioutils/blob/master/saintBioutils/uniprot/__init__.py were failing. Apparently the UniProt API has recently changed. Maybe this is helpful for replacing the queries: https://github.com/multimeric/Unipressed

HobnobMancer commented 1 year ago

Hi,

Thanks for using cazy_webscraper!

I found the cause of the issue a couple of weeks back. It wasn't with saintBioutils, the minimum required version of bioservices needed to be updated - but I forgot to document this here, so my bad!

The Fix If you install the latest version of bioservices then cazy_webscraper will be able to communicate with the new UniProt API.

The required bioservices version will be updated shortly.

In the next couple of weeks, we will also altering how the cazy_webscraper links NCBI protein version accessions to their corresponding record in UniProt. A more robust method for identifying records that are related (i.e. linking a NCBI protein record to it's corresponding UniProt record) is planned to be available in 2.2.4.

mherold1 commented 1 year ago

Thanks for the quick response. Are you sure that the issue is related to the bioservices version? I had 1.10.4 (the latest?) installed. I went through the script: https://github.com/HobnobMancer/cazy_webscraper/blob/master/cazy_webscraper/expand/uniprot/get_uniprot_data.py and where it is failing for me is at the EMBL to Uniprot accessions mapping step through saintBioutils. The returned uniprot_gkb_dict is empty in: https://github.com/HobnobMancer/cazy_webscraper/blob/master/cazy_webscraper/expand/uniprot/get_uniprot_data.py#L180 When testing the other script it was failing at the request (or L94): https://github.com/HobnobMancer/saintBioutils/blob/master/saintBioutils/uniprot/__init__.py#L98

HobnobMancer commented 1 year ago

This issues should now be fixed in v2.2.3 - PR #103