Loading FBMN-GNPS and Bigscape output data into NPLinker error.

jeep3 commented 3 years ago

Hi, Andrew, Recently, I used FBMN (GNPS) and Bigscape output data to be loaded into NPLinker for analysis, and an error was reported during loading. Could you help to see it? BTW, No problem loading classical molecular network data from GNPS. ------error log------ 2021-01-04 22:33:59,664 Error in server loaded hook KeyError('',) Traceback (most recent call last): File "/app/conda/envs/bigscape/lib/python3.6/site-packages/bokeh/server/contexts.py", line 174, in run_load_hook result = self._application.on_server_loaded(self.server_context) File "/app/conda/envs/bigscape/lib/python3.6/site-packages/bokeh/application/application.py", line 193, in on_server_loaded h.on_server_loaded(server_context) File "/app/conda/envs/bigscape/lib/python3.6/site-packages/bokeh/application/handlers/directory.py", line 203, in on_server_loaded return self._lifecycle_handler.on_server_loaded(server_context) File "/app/conda/envs/bigscape/lib/python3.6/site-packages/bokeh/application/handlers/lifecycle.py", line 81, in on_server_loaded return self._on_server_loaded(server_context) File "/app/webapp/npapp/server_lifecycle.py", line 473, in on_server_loaded nh.load() File "/app/webapp/npapp/server_lifecycle.py", line 263, in load self.load_nplinker() File "/app/webapp/npapp/server_lifecycle.py", line 285, in load_nplinker if not self.nplinker.load_data(): File "/app/webapp/npapp/../../prototype/nplinker/nplinker.py", line 243, in load_data if not self._loader.load(met_only=met_only): File "/app/webapp/npapp/../../prototype/nplinker/loader.py", line 229, in load self._filter_strains() File "/app/webapp/npapp/../../prototype/nplinker/loader.py", line 247, in _filter_strains self.strains.filter(common_strains) File "/app/webapp/npapp/../../prototype/nplinker/strains.py", line 64, in filter self.remove(strain) File "/app/webapp/npapp/../../prototype/nplinker/strains.py", line 56, in remove del self._lookup[alias] KeyError: ''

Thanks.

andrewramsay commented 3 years ago

The problem seems to be that there is at least one strain with an empty name in the collection. I could add a fix for this particular error but it'd be good to understand why it's happening in the first place and make sure it's fixed properly.

Do you have any empty fields in your strain_mappings.csv file? For example:

abc,def,,

or

abc,,def

jeep3 commented 3 years ago

It seems not any empty fields. The same sample, the same genomic data, the difference is the FBMN and classical MN data from GNPS. Curious about classical MN data loading into NPLinker is no problem, However, FBMN have this error. This is the strain_mappings.csv file: strain_mappings.xlsx

andrewramsay commented 3 years ago

The strain_mappings.xlsx file you have looks OK itself, but are you exporting it to CSV from that format? If so Excel may be adding empty fields automatically because it assumes the number of columns should be the same on each line. For example the line with only 2 columns became:

CJ-3-1A,CJ-3-1.mzXML,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

when I tried exporting it as a CSV. Can you check if your .csv file has entries like this? If you have been using the same .csv file for the classical data I don't know why it would only be a problem for FBMN, but if the file is newly generated this might be the issue.

The other thing that would help me here is a copy of the full log from nplinker as it loads the data. You can get that by running docker logs webapp > log.txt.

This is unrelated, but I think you could make your strain_mappings.xlsx file a lot smaller! Most of the columns are .gbk filenames and it shouldn't be necessary to list all of them individually like that. nplinker should be able to parse the strain names from these automatically because it defaults to trying extracting the text up to the first underscore (plus a couple of other characters) to use as a strain name.

So a name like "AnHHF-34A_c00001_ctg1_pi...region001.gbk" would become "AnHHF-34A", which would already match the text in the first column of that line. It looks like all the other strain names would match the same rule, so most lines of the mappings file would only need a couple of columns.

jeep3 commented 3 years ago

Thanks, andrew, I will try according to your suggestion. the following is the log file of this time. FBMN log.txt

jeep3 commented 3 years ago

Because GitHub does not support uploading .csv files, I just converted the .csv file to .xlsx format and uploaded it to you. But I found that I have a problem. There is an additional sample of CJ-3-1 in the metabolome data, and there is no such sample in the genome data. Does this have any effect?

andrewramsay commented 3 years ago

Because GitHub does not support uploading .csv files, I just converted the .csv file to .xlsx format and uploaded it to you.

Oh OK, that should be fine then.

the following is the log file of this time.

Is that the full output? The log.txt file it generates should contain all the output from the docker run ... command, that just looks like the last chunk instead of the whole thing.

There is an additional sample of CJ-3-1 in the metabolome data, and there is no such sample in the genome data. Does this have any effect?

NPLinker is designed to filter the full list of strains it finds down to a common set that appear in both the genomics and metabolomics data. According to the snippet of log output you originally posted, the error is happening during this filtering process. It might be related somehow, but since that code is intended to handle cases where a strain doesn't appear on one side or the other it shouldn't be crashing if it finds one that doesn't exist in the genomics data.

The simplest way to fix this would probably be to have some extra debugging output added to this part of the code. Unfortunately it's not currently possible to have the Docker image use a local modified copy of the source code, so I've just pushed an updated image you can try instead. The only change you'll need to make is adding a new line at the beginning of your nplinker.toml file like this:

loglevel = "DEBUG"

This will cause the app to print out a lot more logging information as it goes through the loading process. At the end run the docker logs webapp > log.txt command again and upload the resulting file. Hopefully that will let us see what the problem actually is.

jeep3 commented 3 years ago

FBMN-GCF log.txt

andrewramsay commented 3 years ago

Did you run docker pull andrewramsay/nplinker:latest before saving the log file? I don't see the new log messages I added and the line numbers in the error message are still identical to the original output, so it seems this is still the same image version as before.

jeep3 commented 3 years ago

Sorry, I forgot to update, it's good now. BTW, Is it to add log file output to this command line? docker run --rm --name webapp -p 5006:5006 -v E:\FBMN-GCF:/data:rw andrewramsay/nplinker docker logs webapp > log.txt

andrewramsay commented 3 years ago

Nope, they're separate. If you try combining them you'll get some sort of error from Docker. You need to run the usual "docker run ..." command first, then either from the same terminal or a different one run the "docker logs ..." command.

(I will see if I can have the app generate a file with the log output to avoid having to do this manually in future)

jeep3 commented 3 years ago

Got it, thanks, Where can I find this log.txt file, the following is what I copied from the command box. C:\Users\huali>docker logs webapp > log.txt 2021-01-05 15:14:58,773 Watching: /app/webapp/npapp/theme.yaml 2021-01-05 15:14:58,774 Watching: /app/webapp/npapp/templates/tmpl.basic.py.html 2021-01-05 15:14:58,775 Watching: /app/webapp/npapp/templates/tmpl.onclick.search.py.html 2021-01-05 15:14:58,775 Watching: /app/webapp/npapp/templates/tmpl.onclick.py.html 2021-01-05 15:14:58,775 Watching: /app/webapp/npapp/templates/tmpl.basic.search.py.html 2021-01-05 15:14:58,775 Watching: /app/webapp/npapp/templates/index.html 2021-01-05 15:14:58,775 Watching: /app/webapp/npapp/static/css/ChemDoodleWeb.css 2021-01-05 15:14:58,775 Watching: /app/webapp/npapp/static/css/npapp.css 2021-01-05 15:14:58,775 Watching: /app/webapp/npapp/static/css/bootstrap.min.css 2021-01-05 15:14:58,776 Starting Bokeh server version 1.3.5dev3+5.ge5c1c99e1.dirty (running on Tornado 6.0.4) 2021-01-05 15:14:58,781 User authentication hooks NOT provided (default user enabled) 2021-01-05 15:14:58,787 Bokeh app running at: http://localhost:5006/npapp 2021-01-05 15:14:58,787 Starting Bokeh server with process id: 1 2021-01-05 15:15:20,501 Error in server loaded hook KeyError('',) Traceback (most recent call last): File "/app/conda/envs/bigscape/lib/python3.6/site-packages/bokeh/server/contexts.py", line 174, in run_load_hook result = self._application.on_server_loaded(self.server_context) File "/app/conda/envs/bigscape/lib/python3.6/site-packages/bokeh/application/application.py", line 193, in on_server_loaded h.on_server_loaded(server_context) File "/app/conda/envs/bigscape/lib/python3.6/site-packages/bokeh/application/handlers/directory.py", line 203, in on_server_loaded return self._lifecycle_handler.on_server_loaded(server_context) File "/app/conda/envs/bigscape/lib/python3.6/site-packages/bokeh/application/handlers/lifecycle.py", line 81, in on_server_loaded return self._on_server_loaded(server_context) File "/app/webapp/npapp/server_lifecycle.py", line 473, in on_server_loaded nh.load() File "/app/webapp/npapp/server_lifecycle.py", line 263, in load self.load_nplinker() File "/app/webapp/npapp/server_lifecycle.py", line 285, in load_nplinker if not self.nplinker.load_data(): File "/app/webapp/npapp/../../prototype/nplinker/nplinker.py", line 243, in load_data if not self._loader.load(met_only=met_only): File "/app/webapp/npapp/../../prototype/nplinker/loader.py", line 229, in load self._filter_strains() File "/app/webapp/npapp/../../prototype/nplinker/loader.py", line 249, in _filter_strains self.strains.filter(common_strains) File "/app/webapp/npapp/../../prototype/nplinker/strains.py", line 66, in filter self.remove(strain) File "/app/webapp/npapp/../../prototype/nplinker/strains.py", line 58, in remove del self._lookup[alias] KeyError: ''

C:\Users\huali>

andrewramsay commented 3 years ago

It'll just be in the folder where you ran the command from - in this case you should find it at C:\Users\huali\log.txt

jeep3 commented 3 years ago

log.txt

jeep3 commented 3 years ago

It seems that the log.txt file is incomplete.

andrewramsay commented 3 years ago

No it's fine, this sheds a bit more light on things! There are a couple of different things going on here so I'll try to break it down a little.

Metabolomics strain names aren't being recognised correctly

There are lots of lines like this:

15:15:08 [WARNING] metabolomics.py:467, Unknown strain identifier: SDJY-95-1 (parsed from SDJY-95-1.mzXML Peak area)
15:15:08 [WARNING] metabolomics.py:467, Unknown strain identifier: SDPY-1 (parsed from SDPY-1.mzXML Peak area)
15:15:08 [WARNING] metabolomics.py:467, Unknown strain identifier: SichDY-5 (parsed from SichDY-5.mzXML Peak area)

The problem here is that your strain_mappings.csv file doesn't contain the identifiers like SDJY-95-1 which is what NPLinker is looking for. I can see you've added these identifiers but they're all in the form "SDJY-95-1.mzXML", and NPLinker strips the file extension off during parsing so they don't match up. If you do a search+replace on your strain mappings file to replace ".mzXML" with an empty string and then run the app again, you should see almost all of these "Unknown strain identifier" messages disappear.

At the moment this issue means it isn't recognising any metabolomic strains:

15:15:20 [DEBUG] loader.py:247, Filtering strains: genomics count 38, metabolomics count: 0
15:15:20 [DEBUG] loader.py:248, Common strains found: 0

Empty strain aliases

It still seems like there are empty labels for some strains, and this is what is causing the error you were seeing. Because there are no common strains found due to the above problem, it tries to start removing all strains during the filtering process I mentioned. In the log snippet here you can see it removes an empty alias for AnHHF-34A:

15:15:20 [DEBUG] strains.py:57, Removing strain alias: "AnHHF-34A_c00007_ctg7_pi...region003.gbk"
15:15:20 [DEBUG] strains.py:57, Removing strain alias: ""

then just after that it moves on to AnHSZ-53A which also apparently has an empty label:

15:15:20 [DEBUG] strains.py:50, Removing strain: Strain(AnHSZ-53A) [80 aliases]
15:15:20 [DEBUG] strains.py:57, Removing strain alias: "AnHSZ-53A_c00001_ctg1_pi...region007.gbk"
15:15:20 [DEBUG] strains.py:57, Removing strain alias: "AnHSZ-53A_c00005_ctg5_pi...region006.gbk"
15:15:20 [DEBUG] strains.py:57, Removing strain alias: "AnHSZ-53A_c00004_ctg4_pi...region007.gbk"
15:15:20 [DEBUG] strains.py:57, Removing strain alias: "AnHSZ-53A_c00011_ctg11_p...region001.gbk"
15:15:20 [DEBUG] strains.py:57, Removing strain alias: "AnHSZ-53A_c00009_ctg9_pi...region004.gbk"
15:15:20 [DEBUG] strains.py:57, Removing strain alias: ""

Normally two strains wouldn't share a label, so the error is happening because the "empty" entry has already been removed at this point.

Fixing things

I've updated the app to avoid it adding empty strain labels so you don't have to worry about that problem (just pull the image again!), but you will need to update your strain mappings to remove those .mzXML suffixes.

Also this probably isn't crucial but I would recommend replacing all the .gbk strain names in your mappings file. The first line would simply become:

AnHHF-34A,AnHHF-34

The first label would be matched to all the filenames like "AnHHF-34A_c00001_ctg1_pi...region002.gbk", and the second label would match the metabolomics data. The same process should work for the rest of the strains.

jeep3 commented 3 years ago

Many thanks, andrew, NPLinker server loading completed. log.txt

andrewramsay commented 3 years ago

Good, the log looks much better now!

Since it doesn't seem to be finding the files required for the "rosetta" scoring method, you could also disable this in the .toml file to speed up the loading process a little. If you want to do that just add a line at the beginning like this:

scoring_methods = ["metcalf"]

jeep3 commented 3 years ago

Thanks, Andrew, BTW, I saw from the wiki website that the current version of NPLinker is not available yet. When do you plan to update it? and can the "rosetta" scoring method be used?

andrewramsay commented 3 years ago

Was it this part of the wiki?

Future versions of NPLinker will include other scoring methods, which can be combined in any desired combination by ticking/unticking the checkboxes under the "Selected scoring methods" heading.

That text isn't correct any more, I'll need to update it.

The app does already have support for multiple scoring methods. However at the moment there are only 2 available: Metcalf and the "rosetta" method. You should see checkboxes for both rosetta and metcalf scoring in the web interface at the moment if you still have rosetta enabled, although the rosetta method won't be giving any results based on your log file.

The rosetta method does already work but the problem is it requires more than the standard set of files in your dataset folder. On the metabolomics side you don't need anything extra, but it also relies on parsing the knownclusterblast text files generated by antiSMASH and from your log it isn't finding those at the moment. It currently expects to find them in a "knownclusterblast" folder inside the "antismash" folder of your dataset, one .txt file for each .gbk.

As an example from one of the datasets we tested it with, there are some .gbks with paths like this:

nplinker_shared/dataset_name/antismash/ABC11/ABC11.Scaffold_3.region001.gbk
nplinker_shared/dataset_name/antismash/ABC11/ABC11.Scaffold_5.region001.gbk

Then the corresponding text files are named:

nplinker_shared/dataset_name/antismash/ABC11/knownclusterblast/ABC11.Scaffold_3_c1.txt
nplinker_shared/dataset_name/antismash/ABC11/knownclusterblast/ABC11.Scaffold_5_c1.txt

So it's these text files which must be parsed for the rosetta scoring method to give any results. The series of messages at the end of your last log file show it trying to find these files and failing, e.g.:

Failed to find knownclusterblast file: "/data/antismash/knownclusterblast/LNCT-4A_c00004_ctg4_pi.._c10.txt"

If you don't have these files there's nothing else you can do. If you do have them you could try creating the "knownclusterblast" folder inside the existing "antismash" folder and copying them all in there and then try again. I suspect there might be some more tweaks required to ensure the filenames are matched up to the .gbk files correctly but I don't know for sure.

jeep3 commented 3 years ago

Amazing, you are right, I can see checkboxes for both rosetta and metcalf scoring in the web interface at the moment. But now the rosetta method does not work. Hence, I need to prepare these files "ABC11.Scaffold_3_c1.txt". How to convert ABC11.Scaffold_3.region001.gbk to ABC11.Scaffold_3_c1.txt? Thanks, Andrew, I can't wait to try it.

andrewramsay commented 3 years ago

This is where my limited knowledge of tools like antiSMASH becomes a problem - I was under the impression that it could generate these files along with the others it typically produces if you configure it correctly. But I've never had to do this myself and so I don't know the details of how you would go about doing so.

As far as I know you can't take an existing .gbk file and generate the .txt from it because the information it contains isn't part of the .gbk, I think you have to start with the output from a single antiSMASH run containing all the required files.

Looking at the antiSMASH webservice page it seems to have a "Knownclusterblast" option you can enable which is probably what's required, although I don't know the equivalent if you're running it locally instead of using the webservice.

If you have anyone in your lab who is familiar with this stuff I would check with them, otherwise you could try running it again with that option enabled and see if it produces the files once it's done.

jeep3 commented 3 years ago

I just checked some antismash output files, and there is indeed a knownclusterblast folder in it. Thank you very much for this discussion, which gave me a better understanding of NPLinker.

CunliangGeng commented 2 years ago

I assume this issue has been solved. Please reopen it if not.

NPLinker / nplinker