NPLinker / nplinker-webapp

Apache License 2.0
2 stars 2 forks source link

KeyError: 'cluster' when running local dataset using docker image on windows. #13

Closed ialas closed 1 year ago

ialas commented 1 year ago

I've observed the error "KeyError('cluster') when running the docker image of nplinker-webapp on a local dataset. I've replicated the error in both windows subsystem for linux and windows.

Specifics: Upon running docker run --name webapp -p 5006:5006 -v your_shared_folder:/data:rw nlesc/nplinker with a local data folder set up as following information written here, and "your_shared_folder" set up to match my specific shared folder pathway, I receive the error "Error in server loaded hook KeyError('cluster').

Attached is the output of the run, including the bigscape run. Here is the output of specifically the error:

19:01:55 [INFO] runbigscape.py:63, BiG-SCAPE completed with return code 0
19:01:55 [INFO] loader.py:95, Trying to discover correct bigscape directory under /data/dataset_1/bigscape
19:01:55 [INFO] loader.py:99, Found network files directory: /data/dataset_1/bigscape/network_files/2023-01-02_18-17-10_hybrids_glocal
19:01:55 [INFO] genomics.py:538, Found 1817 MiBIG json files
2023-01-02 19:01:58,633 Error in server loaded hook KeyError('cluster')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bokeh/server/contexts.py", line 193, in run_load_hook
    self._application.on_server_loaded(self.server_context)
  File "/usr/local/lib/python3.10/site-packages/bokeh/application/application.py", line 209, in on_server_loaded
    h.on_server_loaded(server_context)
  File "/usr/local/lib/python3.10/site-packages/bokeh/application/handlers/directory.py", line 262, in on_server_loaded
    return self._lifecycle_handler.on_server_loaded(server_context)
  File "/usr/local/lib/python3.10/site-packages/bokeh/application/handlers/lifecycle.py", line 92, in on_server_loaded
    return self._on_server_loaded(server_context)
  File "/app/nplinker/server_lifecycle.py", line 62, in on_server_loaded
    nh.load()
  File "/app/nplinker/server_lifecycle.py", line 20, in load
    self.load_nplinker()
  File "/app/nplinker/server_lifecycle.py", line 43, in load_nplinker
    if not self.nplinker.load_data():
  File "/usr/local/lib/python3.10/site-packages/nplinker/nplinker.py", line 272, in load_data
    if not self._loader.load(met_only=met_only):
  File "/usr/local/lib/python3.10/site-packages/nplinker/loader.py", line 337, in load
    if not met_only and not self._load_genomics():
  File "/usr/local/lib/python3.10/site-packages/nplinker/loader.py", line 599, in _load_genomics
    self.mibig_bgc_dict = make_mibig_bgc_dict(self.strains,
  File "/usr/local/lib/python3.10/site-packages/nplinker/genomics.py", line 575, in make_mibig_bgc_dict
    accession, biosyn_class = extract_mibig_json_data(data)
  File "/usr/local/lib/python3.10/site-packages/nplinker/genomics.py", line 551, in extract_mibig_json_data
    accession = data['cluster']['mibig_accession']
KeyError: 'cluster'

The full run output text can be seen here: nplinkerError.txt

Additional information: Windows version: 10.0, Build 19044, Windows Education Windows Subsystem for Linux:

Docker version 20.10.17, build 100c701 GNPS was run using METABOLOMICS-SNETS-v2 version_release_30 The setup of dataset_1 is seen here: image Antismash files were generated using a local installation (v6.0.0). The antiSMASH folder contains only the *.gbk files for every strain, with the prefix indicating the strain (ex: a2056.tig0000001.region001.gbk). The docker image pulled was sha256:4092f31590227942823b785627aa4d9488a546fe73fe6bc52f07dd22d0ef2fb6

andrewramsay commented 1 year ago

Looks like it's having a problem parsing the MiBIG JSON files from the mibig_json folder. I can see from the screenshot you're using v1.4 which should be fine, that version has always been supported. The JSON structure changed between v1.x and v2.x and it should handle both, but what's happening according to the error message is it's not finding the expected JSON attribute for v1.x, so it tries v2.x and doesn't find that either.

If I download the current v1.4 database, it seems to have 1816 .json files but your output shows 1817:

19:01:55 [INFO] genomics.py:538, Found 1817 MiBIG json files

Might be worth checking if there's a stray file in there from a different source? Or you could try deleting the folder and let NPLinker create it again.

If that doesn't help you should probably open a copy of this issue over in the core NPLinker repo as this isn't a problem with the webapp itself.

ialas commented 1 year ago

Thank you for the information! After tinkering around with it, it turns out that when using MIBIG version 1.4, the program requests BGC0002000 from MIBIG. After downloading it, it runs into the error seen above when trying to perform some function with that BGC.

Deleting the MIBIG 1.4.tar.gz file and/or the mibig_json folder didn't help, as it simply downloads that BGC again.

The only solution I've found so far is to download mibig_json_2.0.tar.gz, and extract the contents into the mibig_json folder (replacing the previous 1.4 mibig files entirely).

Thank you very much for your assistance!

andrewramsay commented 1 year ago

Good to hear it's working now! To avoid having to do the manual renaming you should be able to tell NPLinker to use v2.0 by setting mibig_version in your NPLinker config file:

[dataset]
root = "..."
mibig_version = "2.0"

(see here)

I think what's happening is:

So chances are the JSON format has been changed again in the v3.x database and the v3.x data in the downloaded file isn't recognised by NPLinker.

I'll create another issue in the NPLinker repo to note that this will need fixed at some point, thanks for reporting it!