NPLinker / nplinker-webapp

Apache License 2.0
2 stars 2 forks source link

(KeyError: 79288) when running local dataset using docker image on windows #14

Closed ialas closed 1 month ago

ialas commented 1 year ago

I've observed the error "KeyError: 79288" when running the docker image of nplinker-webapp on a local dataset generated using feature-based molecular networking through GNPS.

Full error details: nplinkerError-A.txt Relevant text:

19:23:40 [INFO] loader.py:803, Loaded global strain IDs (0 total)
19:23:40 [INFO] loader.py:817, Loaded dataset strain IDs (38 total)
19:23:42 [INFO] metabolomics.py:771, 6897 molecules parsed from MGF file
2023-01-18 19:23:43,244 Error in server loaded hook KeyError(79288)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bokeh/server/contexts.py", line 193, in run_load_hook
    self._application.on_server_loaded(self.server_context)
  File "/usr/local/lib/python3.10/site-packages/bokeh/application/application.py", line 209, in on_server_loaded
    h.on_server_loaded(server_context)
  File "/usr/local/lib/python3.10/site-packages/bokeh/application/handlers/directory.py", line 262, in on_server_loaded
    return self._lifecycle_handler.on_server_loaded(server_context)
  File "/usr/local/lib/python3.10/site-packages/bokeh/application/handlers/lifecycle.py", line 92, in on_server_loaded
    return self._on_server_loaded(server_context)
  File "/app/nplinker/server_lifecycle.py", line 62, in on_server_loaded
    nh.load()
  File "/app/nplinker/server_lifecycle.py", line 20, in load
    self.load_nplinker()
  File "/app/nplinker/server_lifecycle.py", line 43, in load_nplinker
    if not self.nplinker.load_data():
  File "/usr/local/lib/python3.10/site-packages/nplinker/nplinker.py", line 272, in load_data
    if not self._loader.load(met_only=met_only):
  File "/usr/local/lib/python3.10/site-packages/nplinker/loader.py", line 334, in load
    if not self._load_metabolomics():
  File "/usr/local/lib/python3.10/site-packages/nplinker/loader.py", line 714, in _load_metabolomics
    spec_dict, self.spectra, self.molfams, unknown_strains = load_dataset(
  File "/usr/local/lib/python3.10/site-packages/nplinker/metabolomics.py", line 778, in load_dataset
    load_edges(edges_file, spec_dict)
  File "/usr/local/lib/python3.10/site-packages/nplinker/metabolomics.py", line 392, in load_edges
    spec1 = spec_dict[spec1_id]
KeyError: 79288

Specific Information: Windows version: 10.0, Build 19044, Windows Education Docker version 20.10.17, build 100c701 Docker image pulled: sha256:4092f31590227942823b785627aa4d9488a546fe73fe6bc52f07dd22d0ef2fb6 GNPS was run using FEATURE-BASED-MOLECULAR-NETWORKING (version release_28.2) TOML file contained: image Dataset_2 can be seen here: image

justinjjvanderhooft commented 1 year ago

It is hard to solve these type of errors as they are dependent on your input data: it seems to miss a specific spectrum with the ID 79288. You could manually try to find it in the nodes/edge file of GNPS and the MGF file to see if and where it is.... It may have been filtered out in only one of the two files? @hechth may also have ideas as he recently worked on the mass spec side of things....(loading the mass spec data)

ialas commented 1 year ago

Thank you, I wasn't sure if 79288 was referring to a specific type of error or if it was an issue with my input data. I will double check and see if I can resolve it on my end, then close the issue if I resolve it.

hechth commented 1 year ago

@ialas which version of NPLinker are you using? the development version from the github or a stable release?

ialas commented 1 year ago

I believe I'm using a stable release pulled from docker, accessed by using docker pull nlesc/nplinker:latest. This would be the version that was pushed 5 months ago by clgeng, according to docker, with the digest being 4092f3159022.

In the process of tracking down the issue, I've narrowed it down to an interplay between these elements. Here's the mgf file, specifically scan 79288. image

Here's the edges file I believe being referenced during this process, following the software constructing a dictionary of spectra indexed by spectraID, specifically showing 79288. image

What I think is happening is that when spec_dict[79288] is run (very tentative), it's asking spec_dict to find the entry for 79288, but there was no ms2 for 79288, so it pulls an error. I don't know specifically what "new_ms1" in the code is supposed to represent in the LOADMGF Class, so I don't know how to resolve this.

justinjjvanderhooft commented 1 year ago

@hechth could you comment on this? Thx!

ialas commented 1 year ago

We were unable to track down the issue, but we thought it might be an issue in data pre-processing.

So, we switched to utilizing Metaboscape, then generating GNPS files through GNPS FBMN.

We've run into a new Error, with regards to parsing the strain names for the metabolomics side. We've tried every possible form of strain name for strain_mappings.csv that should match to the strains used for metabolomics, but can't resolve this issue.

Is there information on how specifically the metabolomics strain names are parsed from the GNPS files? Is it being pulled from the params.xml, or another file? Once we have exact clarification on how the names are being parsed, I believe we can resolve this issue and identify why this error has been occurring.

2023-01-25 15:20:39,427 Starting Bokeh server version 2.4.3 (running on Tornado 6.2)
2023-01-25 15:20:39,428 User authentication hooks NOT provided (default user enabled)
2023-01-25 15:20:39,430 Bokeh app running at: http://localhost:5006/nplinker
2023-01-25 15:20:39,430 Starting Bokeh server with process id: 1
on_server_loaded
DATAPATH: /data/nplinker.toml
15:20:39 [DEBUG] config.py:123, Parsing default config file: /data/.config/nplinker/nplinker.toml
15:20:39 [DEBUG] config.py:127, Loading user config /data/nplinker.toml
15:20:39 [INFO] config.py:164, Loading from local data in directory /data/dataset_6
15:20:39 [DEBUG] loader.py:191, DatasetLoader(/data/dataset_6, , False)
15:20:39 [DEBUG] nplinker.py:142, Enabled scoring method: metcalf
15:20:39 [DEBUG] nplinker.py:142, Enabled scoring method: testscore
15:20:39 [DEBUG] nplinker.py:142, Enabled scoring method: rosetta
15:20:39 [DEBUG] nplinker.py:142, Enabled scoring method: npclassscore
15:20:39 [DEBUG] nplinker.py:267, load_data (normal case, full load, met_only=False)
15:20:39 [WARNING] loader.py:53, WARNING: unable to find metadata_table_file in path "/data/dataset_6/metadata_table/metadata_table*.txt"
15:20:39 [INFO] loader.py:95, Trying to discover correct bigscape directory under /data/dataset_6/bigscape
15:20:39 [INFO] loader.py:99, Found network files directory: /data/dataset_6/bigscape/network_files/2023-01-24_19-31-03_hybrids_glocal
15:20:39 [INFO] loader.py:288, Updating bigscape_dir to discovered location /data/dataset_6/bigscape/network_files/2023-01-24_19-31-03_hybrids_glocal
15:20:39 [INFO] loader.py:803, Loaded global strain IDs (0 total)
15:20:39 [INFO] loader.py:817, Loaded dataset strain IDs (38 total)
15:20:43 [INFO] metabolomics.py:771, 16413 molecules parsed from MGF file
15:20:44 [DEBUG] metabolomics.py:371, loading edges file: /data/dataset_6/networkedges_selfloop/a2afb56e07d140a3ba4bcb2f059c7cd6..selfloop [16413 spectra from MGF]
15:20:44 [DEBUG] metabolomics.py:782, Nodes_file: /data/dataset_6/clusterinfo_summary/c09cf56f8ca64accb1a4fdbe747299b8.tsv, quant_table_exists?: False
15:20:44 [INFO] metabolomics.py:794, quantification table exists, new-style GNPS dataset
15:20:44 [INFO] metabolomics.py:688, Merged nodes data (new-style), total lines = 16413
15:20:45 [DEBUG] metabolomics.py:837, make_families: 2057 molams + 6068 singletons
15:20:45 [WARNING] loader.py:720, Writing unknown strains from METABOLOMICS data to /data/dataset_6/unknown_strains_met.csv
15:20:45 [INFO] loader.py:729, Loading provided annotation files (/data/dataset_6/DB_result)
15:20:45 [DEBUG] annotations.py:82, Parsed 0 annotations configuration entries
15:20:45 [DEBUG] annotations.py:90, Found 1 annotations .tsv files in /data/dataset_6/DB_result
15:20:45 [DEBUG] annotations.py:100, Parsing GNPS annotations from /data/dataset_6/DB_result/34a260c67dbf470093cf70438e0b0114.tsv
15:20:45 [DEBUG] loader.py:548, Collecting .gbk files (and possibly renaming)
15:20:45 [DEBUG] loader.py:557, Checking for spaces in antiSMASH folder names...
15:20:46 [DEBUG] loader.py:585, .gbk collection took 0.910s
15:20:46 [DEBUG] loader.py:598, make_mibig_bgc_dict(/data/dataset_6/mibig_json)
15:20:46 [INFO] genomics.py:538, Found 1910 MiBIG json files
15:20:48 [DEBUG] loader.py:602, mibig_bgc_dict has 1910 entries
15:20:48 [DEBUG] loader.py:676, Generating antiSMASH filename cache...
15:20:48 [DEBUG] loader.py:686, Cache generation took 0.002s
15:20:48 [DEBUG] loader.py:688, loadBGC_from_cluster_files(antismash_dir=/data/dataset_6/antismash, delimiters=['.', '_', '-'])
15:20:49 [INFO] genomics.py:266, Using antiSMASH filename delimiters ['.', '_', '-']
15:21:03 [INFO] genomics.py:431, # MiBIG BGCs = 60, non-MiBIG BGCS = 2643, total bgcs = 2703, GCFs = 399, strains=1948
15:21:03 [INFO] genomics.py:497, Filtering MiBIG BGCs: removing 6 GCFs and 19 BGCs
15:21:03 [INFO] genomics.py:443, # after filtering, total bgcs = 823, GCFs = 393, strains=68, unknown_strains=0
15:21:03 [DEBUG] genomics.py:450, Loading .network files
15:21:03 [WARNING] loader.py:704, Writing unknown strains from GENOMICS data to /data/dataset_6/unknown_strains_gen.csv
/usr/local/lib/python3.10/site-packages/nplinker/class_info/class_matches.py:248: FutureWarning: In a future version, passing float-dtype values containing NaN and an integer dtype will raise IntCastingNaNError (subclass of ValueError) instead of silently ignoring the passed dtype. To retain the old behavior, call Series(arr) or DataFrame(arr) without passing a dtype.
  counts_df = pd.DataFrame.from_dict(counts, dtype=int)
15:21:03 [INFO] class_matches.py:44, Loaded MIBiG classes, and class matching tables
15:21:03 [INFO] chem_classes.py:102, No CANOPUS results present at /data/dataset_6/canopus. (set run_canopus=true in the .toml to run CANOPUS)
15:21:03 [INFO] chem_classes.py:538, No MolNetEnhancer result present at /data/dataset_6/molnetenhancer. (run it on GNPS and download it here if you want to use it)
15:21:03 [DEBUG] loader.py:468, Loading params.xml
15:21:03 [DEBUG] loader.py:478, Parsed 39 GNPS params
15:21:03 [DEBUG] loader.py:441, Filtering strains: genomics count 68, metabolomics count: 0
15:21:03 [DEBUG] loader.py:444, Common strains found: 0
15:21:03 [INFO] loader.py:449, Writing common strain labels to /data/dataset_6/common_strains.csv
15:21:03 [INFO] loader.py:462, Strains filtered down to total of 0
15:21:03 [INFO] loader.py:371, No further strain filtering to apply
2023-01-25 15:21:03,382 Error in server loaded hook Exception('Failed to find *ANY* strains, missing strain_mappings.csv?')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/bokeh/server/contexts.py", line 193, in run_load_hook
    self._application.on_server_loaded(self.server_context)
  File "/usr/local/lib/python3.10/site-packages/bokeh/application/application.py", line 209, in on_server_loaded
    h.on_server_loaded(server_context)
  File "/usr/local/lib/python3.10/site-packages/bokeh/application/handlers/directory.py", line 262, in on_server_loaded
    return self._lifecycle_handler.on_server_loaded(server_context)
  File "/usr/local/lib/python3.10/site-packages/bokeh/application/handlers/lifecycle.py", line 92, in on_server_loaded
    return self._on_server_loaded(server_context)
  File "/app/nplinker/server_lifecycle.py", line 62, in on_server_loaded
    nh.load()
  File "/app/nplinker/server_lifecycle.py", line 20, in load
    self.load_nplinker()
  File "/app/nplinker/server_lifecycle.py", line 43, in load_nplinker
    if not self.nplinker.load_data():
  File "/usr/local/lib/python3.10/site-packages/nplinker/nplinker.py", line 272, in load_data
    if not self._loader.load(met_only=met_only):
  File "/usr/local/lib/python3.10/site-packages/nplinker/loader.py", line 358, in load
    raise Exception(
Exception: Failed to find *ANY* strains, missing strain_mappings.csv?
hechth commented 1 year ago

@ialas are you sure that the strain_mappings.csv file is in the correct position? Are you running the tool from docker?

ialas commented 1 year ago

image

I am running the tool from Docker on Windows. My setup can be seen in the first comment, and nothing has changed about the software I've used since then.

andrewramsay commented 1 year ago

For FBMN datasets, it should be basing the strain names off the column headers in the quantification_table/<something>.tsv file. This should have a collection of columns named <some file> Peak area, like some_ID.mzXML Peak area. It should just take the first part of the filename as the strain ID, e.g. here it would be some_ID.

There are some extra steps needed for certain situations (the code is here), but that's basically what should be happening.

I'm not sure what's going on here as your log output seems to show it's parsing everything OK and there are no warnings about unknown strains that should be produced if it couldn't find matches in your strain_mappings.csv. But then when it goes to filter down to common strains, it shows there are already no metabolomic strains before the filtering is applied:

15:21:03 [DEBUG] loader.py:441, Filtering strains: genomics count 68, metabolomics count: 0

Also if there were strains being parsed but not recognised they should be written to unknown_strains_met.csv, but from your screenshot it looks like that file is probably empty?

For this to happen without logging any warnings could mean there's something about your dataset that isn't currently accounted for in NPLinker... maybe some column names are in a different format or something like that and so it's not actually managing to get as far as parsing strain IDs at all.

Could you post the first line of the .tsv file in quantification_table so we could see what the column naming looks like?

ialas commented 1 year ago

Here is the first line of the *.tsv file in quantification_table/:

SHARED_NAME,row ID,RT,PEPMASS,SIGMA_SCORE,NAME_METABOSCAPE,MOLECULAR_FORMULA,ADDUCT,KEGG,CAS,MaxIntensity,"WMMA1947_1-A,2_01_25959","WMMC273_1-A,3_01_25960","WMMA1949_1-A,4_01_25961","WMMD980_1-A,5_01_25963","WMMD714_1-A,6_01_25964","WMMD967_1-A,7_01_25965","WMMD1127_1-A,8_01_25967","WMMD718_1-B,1_01_25968","WMMD975_1-B,2_01_25969","WMMD712_1-B,3_01_25971","WMMD964_1-B,4_01_25972","WMMC250_1-B,5_01_25973","WMMC264_1-B,6_01_25975","WMMA1976_1-B,8_01_25977","WMMA2056_1-F,5_01_26016","WMMD573_1-C,3_01_25981","WMMA2059_1-F,6_01_26017","WMMC514_1-C,4_01_25983","WMMD956_1-C,5_01_25984","WMMD937_1-C,6_01_25985","WMMD1076_1-C,7_01_25987","WMMD998_1-C,8_01_25988","WMMD987_1-D,1_01_25989","WMMC241_1-D,2_01_25991","WMMD406_1-D,3_01_25992","WMMD792_1-D,4_01_25993","WMMD1128_1-D,6_01_25996","WMMD1120_1-D,8_01_25999","WMMD1129_1-E,1_01_26000","WMMD1102_1-E,5_01_26005","WMMD791_1-E,6_01_26007","WMMD1082_1-E,8_01_26009","WMMD1155_1-F,1_01_26011","WMMD882_1-F,2_01_26012","WMMD1047_1-F,3_01_26013","WMMA1998_1-C,1_01_25979","WMMD710_1-F,4_01_26015","WMMD961_1-C,2_01_25980"

Here is the first line of the *.tsv file in quantification_table_reformatted/:

row ID,row retention time,row m/z,"WMMA1947_1-A,2_01_25959.d Peak area","WMMC273_1-A,3_01_25960.d Peak area","WMMA1949_1-A,4_01_25961.d Peak area","WMMD980_1-A,5_01_25963.d Peak area","WMMD714_1-A,6_01_25964.d Peak area","WMMD967_1-A,7_01_25965.d Peak area","WMMD1127_1-A,8_01_25967.d Peak area","WMMD718_1-B,1_01_25968.d Peak area","WMMD975_1-B,2_01_25969.d Peak area","WMMD712_1-B,3_01_25971.d Peak area","WMMD964_1-B,4_01_25972.d Peak area","WMMC250_1-B,5_01_25973.d Peak area","WMMC264_1-B,6_01_25975.d Peak area","WMMA1976_1-B,8_01_25977.d Peak area","WMMA2056_1-F,5_01_26016.d Peak area","WMMD573_1-C,3_01_25981.d Peak area","WMMA2059_1-F,6_01_26017.d Peak area","WMMC514_1-C,4_01_25983.d Peak area","WMMD956_1-C,5_01_25984.d Peak area","WMMD937_1-C,6_01_25985.d Peak area","WMMD1076_1-C,7_01_25987.d Peak area","WMMD998_1-C,8_01_25988.d Peak area","WMMD987_1-D,1_01_25989.d Peak area","WMMC241_1-D,2_01_25991.d Peak area","WMMD406_1-D,3_01_25992.d Peak area","WMMD792_1-D,4_01_25993.d Peak area","WMMD1128_1-D,6_01_25996.d Peak area","WMMD1120_1-D,8_01_25999.d Peak area","WMMD1129_1-E,1_01_26000.d Peak area","WMMD1102_1-E,5_01_26005.d Peak area","WMMD791_1-E,6_01_26007.d Peak area","WMMD1082_1-E,8_01_26009.d Peak area","WMMD1155_1-F,1_01_26011.d Peak area","WMMD882_1-F,2_01_26012.d Peak area","WMMD1047_1-F,3_01_26013.d Peak area","WMMA1998_1-C,1_01_25979.d Peak area","WMMD710_1-F,4_01_26015.d Peak area","WMMD961_1-C,2_01_25980.d Peak area"

andrewramsay commented 1 year ago

Sorry, forgot it was the _reformatted folder it reads the file from!

At least part of the problem here is the current version of NPLinker relies on these Peak area column names referencing files with a .mzML or .mzXML extension. It searches for the characters ".mz" and takes the text up to that point as the initial strain name.

So unfortunately the columns here won't be getting handled correctly - I guess it should be using ".d" in place of ".mz" in this case. There isn't any way to change this through the config file, it's hardcoded.

As a temporary fix you could try a search and replace on that first line to swap all the ".d" parts to ".mz".

I'm not sure if that will end up causing further issues but it should at least allow it to actually start parsing the names instead of ignoring everything.

justinjjvanderhooft commented 1 year ago

Thanks @ialas for pointing this out - FYI: with @hechth and the Netherlands eScience Center we are working on an updated code base that will hopefully be able to handle more different inputs in a neater way. Thanks for being enthusiastic about NPLinker and apologies for the struggle to get it to run. Bear with us....

ialas commented 1 year ago

No problem! I'm excited to start leveraging this tool to get cool insights into our genomes. I'll report back if there's any other additional issues.

ialas commented 1 year ago

Hello! I've tried several things to resolve this issue. A

  1. I edited the qiime2_output\qiime2_manifest.tsv and qiime2_output\qiime2_metadata.tsv to end with file extension *.mzML instead of *.d.
  2. I edited the quantification_table_reformatted\*.csv to replace the column names with *.mzML Peak area instead of *.d Peak area.
  3. I changed the strain_mappings.csv to only include information prior to the *.d, so we currently have two columns. One contains the genome names noted in the antiSMASH folders, and one contains the metabolomics-related strain names. Example: WMMA1947_1-A,2_01_25959 instead of WMMA1947_1-A,2_01_25959.d or *.mzML or WMMA1947_1-A\,2_01_25959 and various other permutations we had tried.

Upon running this and iterating through potential strain_mappings names to see if we could cover every possibility, the same error still occurred. B

  1. We took the original *.d files before any processing. I edited the folder names, and any file names, to replace the commas (as seen in WMMA1947_1-A,2_01_25959) to underscores (WMMA1947_1-A_2_01_25959).
  2. We ran the files through Metaboscape, converted it into a GNPS-compatible input, and ran it through FBMN GNPS.
  3. We downloaded the "Download Cystoscape Data" information (all), extracted it, and transferred it to the correct folder/directories as per the wiki.
  4. I then performed step A2, as this error should only be resulting when the software is referencing the quantification tables, not the qiime2 output. (Which indicated to us that somewhere in the *.d raw folder/files is a reference to the original name, so simply changing the file names is not enough) (To clarify: even with changing the folder/filenames prior to data processing, GNPS still wrote the quantitation/qiime2 tables to include the original strain names (commas included), which indicated that we probably have to perform more extensive changes to remove the presence of the commas from our metabolomics runs, if this is indeed a comma-related issue with the parser somewhere)
  5. I then performed step A3 (strain_mappings.csv), but changed the metabolomics-related strain name to WMMA1947_1-A_2_01_25959.

Upon running this, the same error still remained. Exception: Failed to find *ANY* strains, missing strain_mappings.csv?

andrewramsay commented 1 year ago

The presence of commas in the IDs is definitely going to cause some problems setting up strain mappings here, I'd assumed there would never be commas in the names. So maybe that .csv file needs to be something like a .tsv instead since it obviously isn't ideal to have to edit your data to make it work.

After editing the "Peak area" column names are you now seeing anything listed in "unknown_strains_met.csv"? There should be some content in that file if it's parsing anything out of the names, even if it's not matching anything in the mappings file (it should also be logging some unknown strain messages if this is happening).

It sounds like properly fixing this will involve a few different changes to NPLinker. But since I'm not involved in the development any more I'm not sure if some of this might already be fixed in the dev version? I haven't been keeping up with all the changes in that branch!

ialas commented 1 year ago

image There's nothing listed in the "unknown_strains_met.csv" file besides the column header.

andrewramsay commented 1 year ago

There's still something not working when it comes to the parsing of the columns then. Unfortunately there aren't a lot of logging messages in that area of the code so it's harder to figure out what it's actually doing in a situation like this.

I think we'd better wait to hear from the others about the best way forward, but if you're not seeing any content in the unknown strains file and you're still seeing the log message that shows 0 metabolomic strains:

Filtering strains: genomics count 68, metabolomics count: 0

then nothing you do with strain_mappings.csv is going to help. The error message references that file because normally if it ends up with no strains in common between genomics and metabolomics it's because the mappings are at fault, but in your case it's just not picking up any metabolomics strains so it can never find any common ones.

Anyway hopefully this is something that can be sorted out with some relatively small changes!

justinjjvanderhooft commented 1 year ago

@hechth could you comment on this, as you have been working most recently on the metabolomics side of NPLinker....

hechth commented 1 year ago

I think trying to find the solution with the names of the files will become very difficult - I'd recommend using the development version of NPLinker.

@ialas do you have access to a linux platform or server?

gcroci2 commented 1 month ago

A new version of the webapp will be implemented using Plotly Dash. The older version is not maintained anymore, so I'll close this issue.