NPLinker / nplinker

A python framework for data mining microbial natural products by integrating genomics and metabolomics data
https://nplinker.github.io/nplinker
Apache License 2.0
16 stars 10 forks source link

Rosetta Score issue #39

Closed jeep3 closed 2 years ago

jeep3 commented 3 years ago

Hi, Andrew, I want to use the Rosetta score to detect some links, Metcalf scoring was working(for example, Metcalf score value=2.3 to 3.2), however, I found that Rosetta Score is still not working after switching scoring methods. Do you think my configuration in the antismash folder is correct? (A screenshot can be found below) Since the output file I run on the server with Antismash 5 does not include the strains name in front of the .gbk file name, so, I added the strains name in front of all .gbk files for NPLinker to identify and parse. In addition, I also added the strains name in front of all *.txt files in the knownclusterblast folder. Or do I need to do some other preparatory work? Thanks a lot!

Best wishes!

andrewramsay commented 3 years ago

It sounds like it may be a problem with the way that the knownclusterblast filenames are derived from the .gbk filenames. I haven't looked at that part of the code for a long time and it may need updated to account for some of the more recent changes that have been made.

I'll do some testing with it and add some more debug output too so it's easier to check what's going on from the log data.

andrewramsay commented 3 years ago

In addition, I also added the strains name in front of all *.txt files in the knownclusterblast folder. Or do I need to do some other preparatory work?

Do you have a screenshot / list of the original .txt filenames you had before doing this? Ideally we'd want NPLinker to parse everything without requiring manual renaming, so it'd help to know what it might have to deal with as there seem to be different naming conventions in different datasets.

andrewramsay commented 3 years ago

I've updated the Docker image so you can pull the latest version now.

Before you run it, can you:

I've adjusted the code so that it will try to load the knownclusterblast data first instead of the metabolomics data. If it doesn't find the knownclusterblast files it should abort the process immediately so it should be faster to check if it's working or not.

If it seems to work straight away that'd be great, but if not just upload the latest log.txt and we can go from there.

andrewramsay commented 3 years ago

It looks like it's successfully found the knownclusterblast .txt files this time so that part worked!

The new problem is happening later on when it is gathering all the parsed data together. It is crashing because it's trying to divide by the length of a list which should have at least 1 item in it but is apparently empty. I haven't seen this happen before and I don't know if it's a badly formatted file or something else that's preventing it from parsing correctly.

At the moment there's no way to tell which file caused the error, but I've pushed an update to the Docker image which should print a log message when it detects this situation. Can you try pulling the new image and sending me another log from running it again? That should show us which file(s) it is failing to parse and then we can try to figure out what is wrong with them.

andrewramsay commented 3 years ago

OK, unfortunately it looks like there are quite a lot of these problematic files, I was expecting only a couple. If you look at the log, every line like this indicates an instance of the problem:

13:25:23 [DEBUG] kcb.py:147, KCBParser failed to extract any MiBIG genes from file /data/antismash/JXZS-118-5B/knownclusterblast/JXZS-118-5B_c00001_ctg1_pi.._c3.txt, BGC ID BGC0001446

Could you send 2 or 3 of the files that are mentioned in all of these messages? So for the message above it would be [dataset folder]/antismash/JXZS-118-5B/knownclusterblast/JXZS-118-5B_c00001_ctg1_pi.._c3.txt and so on. It doesn't matter which particular files, just pick a few different ones at random.

jeep3 commented 3 years ago

image In addition, Andrew, I would like to ask a question. This summary file exists in every Antismash Folder. Do I need to remove it? In fact, the antismash folder .gbk and Knowclusterblast folder .txt file from Antismash 5 output does not contain the prefix of strain name, which I modified in batch later. Is this operation necessary? Because it seems that the *. gbk files are not prefixed by the strain name, so NPLinker cannot parse these files.

justinjjvanderhooft commented 3 years ago

could you add the "knownclusterblastoutput.txt" for some strains?

jeep3 commented 3 years ago

Thanks, Justin, now I firstly correct the file name and see if it works.

andrewramsay commented 3 years ago

Sure, Sorry, I didn't see the message in time, I think I know what the problem is, should be the file name is wrong. JXZS-118-5B_JXZS-118-5B_c00001_ctg1_pi.._c1.txt JXZS-118-5B_c00012_ctg12_p.._c2.txt image

I don't think the filenames are the problem here. Or at least they aren't the whole problem. If some of them are wrong after you renamed them, the best thing to do is probably remove the renamed files and copy in the originals again so everything is consistent. I think they should be handled correctly this time.

But there's a line in the log which records how many .gbks were matched up with a .txt successfully:

13:25:47 [INFO] rosetta.py:166, GenBank files matched with knownclusterblast results: 1878/1902

That shows almost all of them were apparently OK. There are 24 files that it failed to match up, and 24 instances of this type of message in the log:

13:25:43 [WARNING] rosetta.py:151, KCBParser failed to find file "/data/antismash/HNCD-7A/knownclusterblast/HNCD-7A_c00001_ctg1_pi.._c1.txt"

which makes sense.

Both of those JXZZ-118-5B... files you attached are formatted correctly, they don't contain any hits but that's OK and they parse as expected. I don't see either of them mentioned in the log as having problems either.

What I was looking for is some of the files that are listed with the "failed to extract any MiBIG genes" messages. There might be multiple messages linked to the same file but there still seem to be a few hundred of them. Some of the files I'd be interested in seeing are:

/data/antismash/GDZJ-105-2A/knownclusterblast/GDZJ-105-2A_c00007_ctg7_pi.._c3.txt
/data/antismash/HBHA-134-8A/knownclusterblast/HBHA-134-8A_c00004_ctg4_pi.._c3.txt
/data/antismash/HNCD-7A/knownclusterblast/HNCD-7A_c00008_ctg8_pi.._c3.txt
/data/antismash/SichGA-2A/knownclusterblast/SichGA-2A_c00004_ctg4_pi.._c4.txt

If you can pick out 1 or 2 of those that'd be very helpful because there must be something different about the way they are formatted that is breaking the parser we have.


In addition, Andrew, I would like to ask a question. This summary file exists in every Antismash Folder. Do I need to remove it? In fact, the antismash folder .gbk and Knowclusterblast folder .txt file from Antismash 5 output does not contain the prefix of strain name, which I modified in batch later. Is this operation necessary? Because it seems that the *. gbk files are not prefixed by the strain name, so NPLinker cannot parse these files.

I think you can just leave those files in place. You can remove them if you prefer but it shouldn't break anything to have them there.

andrewramsay commented 3 years ago

Great, I can see the problem now. For some of the "Details" sections in these .txt files, there are no entries in the section that the Rosetta code usually parses. This is causing the error at the end of the log because the length of the list of table entries is zero. So for example in GDZJ-105-2A_c00007_ctg7_pi.._c3.txt, the first "Details" section is:

1. BGC0001445
Source: leporin B
Type: NRP + Polyketide:Iterative type I
Number of proteins with BLAST hits to this cluster: 10
Cumulative BLAST score: 15649

Table of genes, locations, strands and annotations of subject cluster:

Table of Blast hits (query gene, subject gene, %identity, blast score, %coverage, e-value):
evm.TU.ctg7_pilon_pilon_pilon.237   AFLA_066840 99  7695    100.45824847250509  0.0
evm.TU.ctg7_pilon_pilon_pilon.238   AFLA_066850 95  171 94.79166666666666   4.9e-43
evm.TU.ctg7_pilon_pilon_pilon.239   AFLA_066860 100 1134    80.1418439716312    0.0
evm.TU.ctg7_pilon_pilon_pilon.240   AFLA_066880 85  746 110.72261072261071  2.8e-215
evm.TU.ctg7_pilon_pilon_pilon.241   AFLA_066890 99  1008    100.0   4.7e-294
evm.TU.ctg7_pilon_pilon_pilon.242   AFLA_066900 98  1626    102.09617755856965  0.0
evm.TU.ctg7_pilon_pilon_pilon.243   AFLA_066910 99  705 100.0   3.5e-203
evm.TU.ctg7_pilon_pilon_pilon.244   AFLA_066920 100 740 100.0   9.8e-214
evm.TU.ctg7_pilon_pilon_pilon.245   AFLA_066930 100 1049    93.88888888888889   1.9e-306
evm.TU.ctg7_pilon_pilon_pilon.246   AFLA_066940 100 775 100.0   3e-224

The code expects to find entries in the "Table of genes, locations, strands and annotations of subject cluster:" table, and in this case it's empty. The next BGC in the same file is fine though, it has a single entry there:

2. BGC0001254
Source: ACT-Toxin II
Type: Polyketide
Number of proteins with BLAST hits to this cluster: 1
Cumulative BLAST score: 2103

Table of genes, locations, strands and annotations of subject cluster:
BAJ14522.1  BAJ14522.1  83  7456    +   polyketide_synthase

Table of Blast hits (query gene, subject gene, %identity, blast score, %coverage, e-value):
evm.TU.ctg7_pilon_pilon_pilon.237   BAJ14522.1  47  2103    63.365580448065174  0.0

I'm not sure what to do about this since I didn't write this part of the code and I'm not very familiar with the intricacies of antiSMASH and knownclusterblast (e.g. I don't know what it means that the "Table of Blast hits" is populated when the other one is empty).

@sdrogers, should I just modify the code to skip over cases like this or is there an alternative way the parsing could be done?

jeep3 commented 3 years ago

Great, I admire that you can quickly find the root of the problem.

justinjjvanderhooft commented 3 years ago

@grimur do you have a suggestion? @andrewramsay I would for now skip over the "empty" entries to make it work for those that are correct, but I think Grímur wrote this code, and I don't exactly know what piece of information gets parsed from here....

andrewramsay commented 3 years ago

@andrewramsay I would for now skip over the "empty" entries to make it work for those that are correct, but I think Grímur wrote this code, and I don't exactly know what piece of information gets parsed from here....

OK, I've updated the Docker image to do that - there will still be a warning message in the log each time it happens but it should remove the affected objects so that the divide-by-zero exception doesn't occur.

grimur commented 3 years ago

Hi -

The known cluster blast parser is Simon’s, but as far as I can see, this is due to antiSMASH returning a malformed output file: normally, this list should contain all the genes in the MIBiG entry being referenced, but for some reason, it is occasionally empty.

We can see that antiSMASH does at some point have the genes in the MIBiG entry, because it reports BLAST hits for some of them, but I don’t know why it doesn’t output that list.

Are you using the antiSMASH results from antiSMASH DB, or running it yourself? If the latter, do you get any error messages? It might be worth finding an assembly that has a BGC with an empty MIBiG hit and see if antiSMASH alerts us about this (I can see this happen in my data as well, but when I was running antiSMASH, I tended to do so in batches and not capture the logs…)

I’m presuming the error occurs when we’re computing the % of the BGC which matches the MIBiG BGC. As far as I can tell, we have two options: return a sensible error message when this list is empty (something like ‘can’t determine coverage’) or parse the list of genes from MIBiG on our own. The latter would require MIBiG to be baked into NPLinker - which it isn’t, right?

In any case, if this is from antiSMASH DB - or if we can verify that we aren’t doing anything seriously wrong - I think this calls for a bug report to the antiSMASH people...

… that being said, I think we might want to consider changing the KCB parser to use the json results, as discussed here: https://github.com/antismash/antismash/issues/291

Best regards,

-grímur On 4 Feb 2021, 09:54 +0000, Andrew Ramsay notifications@github.com, wrote:

@andrewramsayhttps://github.com/andrewramsay I would for now skip over the "empty" entries to make it work for those that are correct, but I think Grímur wrote this code, and I don't exactly know what piece of information gets parsed from here....

OK, I've updated the Docker image to do that - there will still be a warning message in the log each time it happens but it should remove the affected objects so that the divide-by-zero exception doesn't occur.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sdrogers/nplinker/issues/39#issuecomment-773180464, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAFYXIC6CMPFUI36RPNA2MDS5JVDTANCNFSM4W2M6BSQ.

justinjjvanderhooft commented 3 years ago

@jeep3 I think you run antiSMASH on all the genomes, right? Could you rerun some of the ones giving problems and check if any error pops up? @grimur @sdrogers yes, the text format seems to be the legacy format so switching over to the json format seems to be the logical option to prevent more things falling over....

andrewramsay commented 3 years ago

OK, I guess what I'll need to do is add support for parsing the JSON files first while keeping the text parser around as a fallback for older datasets.

I just picked one of the JSON files at random from a dataset I have here and it covers some of the files which have missing output, so that's helpful to check if I can find the missing info in the JSON correctly. From a quick look through things seem promising... I think I can see how to extract everything that is normally parsed from the text version, including the missing lines from the table!

I'll take a better look at it during next week and let you know when I've got NPLinker updated.

justinjjvanderhooft commented 3 years ago

Awesome @andrewramsay - I think that would be the best solution indeed. Those JSON files are included in the standard output for all samples, right?

andrewramsay commented 3 years ago

Just a quick update: still working on this and making progress, but there are more cases to handle than I thought at first so it might be another day or two before I have a new Docker version you can try out.

grimur commented 3 years ago

Ok, thanks for the info. If you take a look at the known cluster blast output files, is there still the empty-table problem? The files are the .txt files in the knownclusterblast folder, and the empty table that’s causing trouble is “Table of genes, locations, …” for any of the hits in the “Details” section. On 9 Feb 2021, 14:13 +0000, jeep3 notifications@github.com, wrote:

Good job, Many thanks, Andrew. I re-run 5 strains using antismash 5.1.2, There seems to be no error. There is only one warning information, but it should look ok. I think one possible reason(blank value) is that in the past I only used fasta file input to antismash. I should use FASTA+GFF input to antismash. I'll try it as soon as possible. and then feedback to you. Thanks again. antismashlog.txthttps://github.com/sdrogers/nplinker/files/5951878/antismashlog.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sdrogers/nplinker/issues/39#issuecomment-775968688, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAFYXIEEWWK45ODMJE7GHX3S6E7I7ANCNFSM4W2M6BSQ.

andrewramsay commented 3 years ago

I think I've finally got this working enough that you can try it, there's a new Docker image available now.

It will check for a .json file in each antiSMASH folder as the primary source of knownclusterblast info, then fall back to the text files if the .json file isn't available.

I've tested it with a couple of datasets including the Crusemann one (which has a mix of JSON and text) and it seems OK, but as usual just include the log output if/when you run into any problems.

jeep3 commented 3 years ago

Nice, I'm preparing a larger data set to try. and then I'll feedback to you. Thanks, Andrew.

jeep3 commented 3 years ago

Hi, Andrew, a new error occurred when NPLinker was loading data, Can you help find out what went wrong? Thanks.

jeep3 commented 3 years ago

For example, I don't know how to adjust base on 'TypeError: list indices must be integers or slices, not str'.

andrewramsay commented 3 years ago

Looks like a bug, not anything you did wrong. Probably something to do with the new JSON parsing code I added. I'll see if I can get it to produce the same error here.

jeep3 commented 3 years ago

Ok, if you need any information, please feel free to tell me.

andrewramsay commented 3 years ago

I think I've found the problem, there was a mismatch in the structure of the data returned by the JSON parser compared with the text parser. There's a new Docker image available which should fix it.

jeep3 commented 3 years ago

Excause me, Idon't know what happened. the text get bigger.

andrewramsay commented 3 years ago

I think this is one of the downsides of relying on a browser-based interface, there can be a lot more overhead than in a native application. From the screenshot that looks like a much larger dataset (in terms of the number of BGCs/spectra) than any of the ones I've had access to so far which probably explains some of the slowness. I thought Crusemann was on the larger end and it only has a fraction of that amount.

What sort of hardware are you currently running it on? If it's not particularly fast and doesn't have a lot of RAM it might struggle to load everything.

There are a couple of things you can try:

andrewramsay commented 3 years ago

One other thing I just remembered: when we were emailing about a different problem back in December, I think you had asked about adding a way to filter out some parts of the dataset while it was being loaded. I added an issue for this (#33) but had forgotten about it until now.

Would having that feature be useful for this large dataset? You could write a list of strain IDs into a file and NPLinker would discard all objects that didn't contain or weren't linked to one of those strains? (I'm not sure what would make the most sense here)

andrewramsay commented 3 years ago

Let me split this reply in two:

Docker/server stuff: -- The URL thing can be a bit confusing, I should clarify:

You can see in the log that it is blocking connections that don't match the configured URL:

2021-02-17 12:02:02,728 Refusing websocket connection from Origin 'http://localhost:5006'; use --allow-websocket-origin=localhost:5006 or set BOKEH_ALLOW_WS_ORIGIN=localhost:5006 to permit this; currently we allow origins {'thornton.bioinformatics.nl:5006'}

You can have multiple allowed URLs though, so I will tweak the Docker image so that localhost:5006/npapp will always work in addition to any server URL that's been configured.

Performance problems

From the log I can see it's having a lot of problems even on the server, e.g.

11:33:01 [INFO] methods.py:420, MetcalfScoring.setup (bgcs=18333, gcfs=422, spectra=22837, molfams=10745, strains=234
11:39:51 [INFO] methods.py:427, MetcalfScoring.setup completed

So that's taking over 6 minutes just to run the initial pass of Metcalf scoring, which is normally very fast - it takes seconds for the Crusemann data!

Maybe I need to reduce the number of samples.

Did you see my question about allowing filtering of the dataset, so you could effectively select subset of the whole thing and only load that particular part into NPLinker? This would give you control over how many objects you have loaded. If that's something you'd like to try let me know.

justinjjvanderhooft commented 3 years ago

Thanks @andrewramsay!

Regarding performance issue --> does it always calculate all Metcalf scores? Trying a threshold of 10 may work to further trim the results down a bit? I think @jeep3 will be interested in knowing how to pass on a selection of the samples as that means that we do not need to repeat the original setup with all the files and we could make different selections.

Regarding server and local browser --> @jeep3 to tunnel the local host from the server to your laptop, the command will look like this: ssh -L 5006:thornton:5006 myers.bioinformatics.nl

andrewramsay commented 3 years ago

Regarding performance issue --> does it always calculate all Metcalf scores? Trying a threshold of 10 may work to further trim the results down a bit?

Yes - it's using the code Florian wrote a couple of years ago which builds co-occurrence matrices and uses them to precalculate the scores for all combinations of objects. His code was much much faster than the original implementation which is why I'm surprised it's taking so long here. The score data it generates could actually be cached quite easily, I hadn't done that before because it was always fast enough it didn't really matter. I'll add that in soon.

I think @jeep3 will be interested in knowing how to pass on a selection of the samples as that means that we do not need to repeat the original setup with all the files and we could make different selections.

OK, what would be the most sensible way of doing this? A whitelist of strains? Separate whitelists for genomics and metabolomics objects? If I know what to add I can try to get that done this week.

I might also make a start on a native version of the webapp, although that'd be a longer term thing. Some of the problems (like the Metcalf scoring) are not being caused by running things in the browser, but I still think trying to stuff tens of thousands of objects into those HTML-based tables is going to remain very slow even if most other things are fast.

justinjjvanderhooft commented 3 years ago

Justin, I didn't catch your meaning "tunnel the local host from the server to your laptop, the command will look like this: ssh -L 5006:thornton:5006 myers.bioinformatics.nl" You mean on the server shell to type: ssh -L 5006:thornton.bioinformatics.nl? Because at this point, NPLinker occupies the command window on my laptop. @jeep3 if you use a cmd shell and perform the ssh command, you can then open the local host page on your own laptop (whilst keeping the connection open)

justinjjvanderhooft commented 3 years ago

Yes - it's using the code Florian wrote a couple of years ago which builds co-occurrence matrices and uses them to precalculate the scores for all combinations of objects. His code was much much faster than the original implementation which is why I'm surprised it's taking so long here. The score data it generates could actually be cached quite easily, I hadn't done that before because it was always fast enough it didn't really matter. I'll add that in soon. Okay @andrewramsay that makes sense then - this is quite a large dataset (pushing boundaries....) - then highering the threshold to a Metcalf score of 10 will not reduce the calculation time, but it may speed up the rendering of the browser?

justinjjvanderhooft commented 3 years ago

_OK, what would be the most sensible way of doing this? A whitelist of strains? Separate whitelists for genomics and metabolomics objects? If I know what to add I can try to get that done this week.

I might also make a start on a native version of the webapp, although that'd be a longer term thing. Some of the problems (like the Metcalf scoring) are not being caused by running things in the browser, but I still think trying to stuff tens of thousands of objects into those HTML-based tables is going to remain very slow even if most other things are fast._

@andrewramsay yes, it will probably remain slow, but the key thing is that it will at least work (at reasonable speed) once loaded.

And I think a selection could be best done on the either the strain labels (pointing to both genomics and metabolomics items) or a list of the genomics strains. Since the objects are linked, do we also need to add the metabolomics items? I am assuming that all the data is on the relevant place including the file that links the genomic and metabolic data files (and has strain labels?!), and now we make a subset of that to load into the web app (and calculate the scores) @jeep3 what do you think?

jeep3 commented 3 years ago

_And I think a selection could be best done on the either the strain labels (pointing to both genomics and metabolomics items) or a list of the genomics strains. Since the objects are linked, do we also need to add the metabolomics items? I am assuming that all the data is on the relevant place including the file that links the genomic and metabolic data files (and has strain labels?!), and now we make a subset of that to load into the web app (and calculate the scores)__ Same there

justinjjvanderhooft commented 3 years ago

Sorry, I'm a little confused, This command is run in the cmd shell of my laptop, right? When Nplinker was loaded, I opened the website(http://thornton.bioinformatics.nl:5006/npapp) in the browser on my laptop? as you say, at the moment, the cmd shell is occupied by a NPLinker. At this point, I should open a new cmd shell to input ssh -L 5006:thornton.bioinformatics.nl? Indeed, you need a new cmd shell, but the tunnelling command should be something like this: ssh -L 5006:thornton:5006 myers.bioinformatics.nl

andrewramsay commented 3 years ago

Okay @andrewramsay that makes sense then - this is quite a large dataset (pushing boundaries....) - then highering the threshold to a Metcalf score of 10 will not reduce the calculation time, but it may speed up the rendering of the browser?

Exactly, anything that reduces the number of objects should make the rendering process a bit faster.

And I think a selection could be best done on the either the strain labels (pointing to both genomics and metabolomics items) or a list of the genomics strains. Since the objects are linked, do we also need to add the metabolomics items? I am assuming that all the data is on the relevant place including the file that links the genomic and metabolic data files (and has strain labels?!), and now we make a subset of that to load into the web app (and calculate the scores)

Assuming the strains are all correctly identified in strain_mappings.csv I think either way should work and there wouldn't be any need to add entries for the metabolomics items manually. As long as there is a set of user-supplied strains it can filter out both metabolomics and genomics objects which don't contain at least one of the strains in the set.

I'll push an update to the Docker image once I've finished this (and added the Metcalf scoring caching too).

andrewramsay commented 3 years ago

A quick question about the strain filtering feature...

Currently NPLinker has a built-in filtering step which takes the combined set of strain IDs parsed from the genomics & metabolomics data and trims it down to include only those which are common to both (this doesn't remove any BGC/GCF/Spectrum/MolFam objects, just reduces the number of strains).

How should this work when you're also supplying a list of strains you explicitly want included? There are a few potential ways to handle it and I'm not sure what would make most sense:

  1. Do the common strains step as usual, then do the "include only these strains" step?
  2. Do the "include only these strains" step, then remove the non-common strains?
  3. Do the common strains step as usual, unless there is a set of "include these strains only" supplied by the user. If so, then only do that part and skip the common strains step?

Any suggestions?

jeep3 commented 3 years ago

Cool, I prefer option 3.

justinjjvanderhooft commented 3 years ago

Good question @andrewramsay - I will touch base with with @grimur tomorrow so I will shortly discuss it with him. To me, if we store the results of the original "common strains" step, 1 makes most sense as you can then supply the "include only these strains" step and easily make another selection based on the same initial matching. If those initial matching results are not stored, than option 3 makes sense to me.

grimur commented 3 years ago

After discussing with @justinjjvanderhooft I agree that option 1 would make most sense, as long as it's a display issue, and not an issue of keeping the strains in memory. The problem with option 3, as I see it, is that you really don't want to do e.g. strain correlation scoring unless you actually have genomic and metabolomic info for all the strains in your analysis. I think we should always restrict analysis to shared strains, or at least make it so that it takes conscious effort to introduce asymmetry.

justinjjvanderhooft commented 3 years ago

Thanks @grimur for summarizing that - to add to this, if NPLinker caches the input data and IOKR scores, it makes sense to do that once (assuming that the size of the dataset mainly poses issues on the HTML rendering of the visualization and not on the backend calculations. Furthermore, BiG-SCAPE results will also differ for different amounts of input data, so to do that only once on the complete matchable set makes most sense to us.

andrewramsay commented 3 years ago

After discussing with @justinjjvanderhooft I agree that option 1 would make most sense, as long as it's a display issue, and not an issue of keeping the strains in memory.

@grimur sorry to keep asking questions, but I'm not sure I understand what you're getting at here? My intention was to take this set of user-supplied strains, then go through the lists of BGC/Spectrum objects and remove any that didn't include any of those strains (and then in turn delete any GCFs/MolFams that ended up empty). The issue with rendering the UI that Justin mentions is not (directly) due to the number of strains involved, it's the numbers of the other object types. There should be no problem in keeping all the strains in memory since they're not listed in one of the tables like the BGCs/Spectra/GCFs/MolFams. But if I keep all the strains, then the number of other objects isn't going to change and that's the bigger problem...

Thanks @grimur for summarizing that - to add to this, if NPLinker caches the input data and IOKR scores, it makes sense to do that once (assuming that the size of the dataset mainly poses issues on the HTML rendering of the visualization and not on the backend calculations. Furthermore, BiG-SCAPE results will also differ for different amounts of input data, so to do that only once on the complete matchable set makes most sense to us.

I'd also really like to clarify some points touched on here in case I've been making some very wrong assumptions. This is a summary of the steps NPLinker would normally take when loading the webapp:

  1. (if needed) retrieve remote genomics + metabolomics data when using a platform dataset
  2. (if needed) run BiG-SCAPE on the antiSMASH data (only done once)
  3. parse strain_mappings.csv to create a list of valid strain IDs/labels
  4. parse all the data files and build full lists of BGC/GCF/Spectrum/MolFam objects (this includes another filtering step I'd forgotten about which removes GCFs that only contain MiBIG BGCs)
  5. filter the original list of strains down to the common set between genomics+metabolomics. This only removes strain objects, not any of the other types, although it does modify the GCFs and Spectra by updating their lists of strains
  6. do the initial pass of Metcalf scoring with the configured threshold. This will be run on the full set of objects that exists at this point (and the results will be cached starting from the next version, this was the step taking 6 minutes on your big dataset)
  7. finally there's some further preprocessing done to establish the links between the various types of object in the format that the code behind the tables expects, but no object modification/deletion is involved

My concern is that I was planning to put this extra filtering step between 5 and 6 in that list. It would involve removing objects from memory, so they would be excluded from the initial Metcalf pass and any other user-initiated scoring done via the browser interface.

But now I'm thinking this might not be very helpful if it messes up things like the strain correlation scoring Grimur mentioned by pruning the number of objects available.

Am I missing something? I don't think this will be a lot of work to implement once I'm clear on what needs to be done, I'm just keen to avoid a situation where someone ends up wasting a lot of time because they didn't realise they weren't working with the set of objects they thought they were!

justinjjvanderhooft commented 3 years ago

Thanks @andrewramsay for the clear overview - I think we can put in the additional filtering step between 5 and 6 as the GCFs and MFs do not change anymore - just the number of their members will fluctuate (and indeed the actual MetCalf score, and the other scores as a result of that - something to be aware of....) if you would restart the analysis on the same data but with a different subset (all the spectral and BGC objects are still stored) - @grimur could you confirm that this would work for calculation of your new scores as well? This subsetting is trickier than I thought it would be....

grimur commented 3 years ago

@andrewramsay not at all, ask away - I'm not even sure myself if what I'm saying makes sense. And thanks for the clarification.

Is there anything to gain by keeping the strains in memory, then? The reason @justinjjvanderhooft and I had for keeping them around was some vague notion that it would be easier to redo the analysis for a different subset, but that's perhaps a premature optimisation and simpler to just change the subset list and reload the whole thing.

Putting the filtering step between steps 5 and 6 makes perfect sense, just as long as any computation of scores is done after any filtering steps that might remove strains from the population.

I'm wondering, though - would swapping steps 4 and 5 make sense - to only load the strain IDs in the final set of IDs, instead of loading all the strains, creating all the objects and deleting them afterwards if the strain is excluded? So the order of the steps would be 1, 2, 3, 5, user filtering, 4, 6, 7?

andrewramsay commented 3 years ago

Thanks for the responses, that helps make things clearer for me! I've been really busy with other things this week but will try to get an updated Docker image ready as soon as I can find some time.

I'm wondering, though - would swapping steps 4 and 5 make sense - to only load the strain IDs in the final set of IDs, instead of loading all the strains, creating all the objects and deleting them afterwards if the strain is excluded? So the order of the steps would be 1, 2, 3, 5, user filtering, 4, 6, 7?

I can see how this could work, but I think the problem as things stand is that there's no way for NPLinker to do step 5 before parsing the various data files in step 4 because it relies on constructing the objects to discover which strains occur where. The list of strain IDs it reads from strain_mappings don't have any context, it's just "here's a strain ID and possibly various aliases for it". Until it goes through the files and finds where those IDs occur, it doesn't know which are found in genomics/metabolomics/both types of objects.

It should already skip over creating objects which have strain IDs that don't appear in strain_mappings.csv so that should help to limit the number of objects created+deleted at least a little.

The implementation of step 5 is just a few lines here in case there's an obvious alternative way to do it. I suppose it could parse the data files, extract strain IDs along with an indicator of the source, do the step 5 filtering based on that, and only then go back and create all the BGC/GCF/Spectrum/MolFam objects using the reduced set of strains? Don't see why that wouldn't work, but it'd take a bit more time to rewrite the parsing code since it currently constructs the objects as it goes through the files line by line.

justinjjvanderhooft commented 3 years ago

--> I suppose it could parse the data files, extract strain IDs along with an indicator of the source, do the step 5 filtering based on that, and only then go back and create all the BGC/GCF/Spectrum/MolFam objects using the reduced set of strains?

@andrewramsay if that happens on the basis of the same initial BGC/GCF/Spectrum/MolFam assignments for different subsets a user selects, than it would at least provide "consistent" results and scores. Some GCFs may become (very) small.... and another thought occurred to me: we could let the user recalculate the scores based on the subset of strains. @grimur what are your thoughts? @andrewramsay in the end, it would be nice to get your view on how to make this process most robust to future influx of datasets and possible novel scores to calculate.

andrewramsay commented 3 years ago

I've updated the Docker image with my first attempt at implementing this change. The key points are:

  1. The app now looks for an optional file called include_strains.csv in the dataset folder. This should contain a single column of strain labels as found in strain_mappings.csv (it doesn't matter which label you use for a particular strain, as long as it's in the mappings somewhere). Once the common strains filtering (step 5) is finished, the app will:
    • take the list of common strains and filter it down to only those given in this file
    • delete all BGC & Spectrum objects that don't include any of the remaining strains
    • delete all GCF & MolFam objects that now don't contain any BGCs or Spectra respectively
    • this all happens before any scoring takes place
  2. While I was testing the changes I realised it was easy to end up with a set of strains in include_strains.csv that doesn't overlap with the set of common genomic+metabolomic strains NPLinker collects. If that happens you get zero strains and other objects left and it has to abort. To make it easier to avoid this, the app now automatically writes a list of those common strains into a file called common_strains.csv, one strain label per line. My idea was that you'd run NPLinker once to get this list, then copy/paste it into include_strains.csv, delete the strains you didn't want to retain, then run NPLinker again so it could apply that extra filtering.
justinjjvanderhooft commented 3 years ago

Thanks @andrewramsay! I think this makes sense and can be easily explained to people. Same GCFs/MFs, but different scorings dependent on the subset.

@jeep3 could you test it at some point? You could make a selection from the common_strains.csv that covers the different clades and make a subset of 100 strains and see if that loads okay for example.