MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
13 stars 22 forks source link

Validator Splash mismatch #384

Closed meowcat closed 12 months ago

meowcat commented 1 year ago

In a bunch of spectra we are getting a validation error about mismatching SPLASH. The For example:

08:11:46.105 ERROR massbank.cli.Validator - SPLASH from record file does not match SPLASH calculated from peaklist. splash10-001j-9000100300-840de07e19f6ae3d5b0e defined in record file, but splash10-001j-9000100300-153a9c6153253ef2a9a5 calculated from peaks.
08:11:46.106 ERROR massbank.cli.Validator - ACCESSION: MSBNK-EAWAG-ED046704
08:11:46.106 ERROR massbank.cli.Validator - ^
08:11:46.106 ERROR massbank.cli.RefreshDatabase - Error reading/validating record "/MassBank-data/recdata/MSBNK-EAWAG-ED046704.txt".

I tried different approaches related to rounding to find the possible cause, but could only reproduce the record splash, not the one calculated by the validator. Note that the two "summary" blocks are intact, just the actual hash doesn't match.

The online SPLASH calculator https://splash.fiehnlab.ucdavis.edu/ however gives the validator splash. I still think there's someting related to rounding or truncation going on here.

https://gist.github.com/meowcat/4f1838c7dc45d70cd9c340dab108549c

meier-rene commented 1 year ago

I can confirm your issue. It seems that the java and the R implementation produce different result. R -> splash10-001j-9000100300-840de07e19f6ae3d5b0e java -> splash10-001j-9000100300-153a9c6153253ef2a9a5 I made a quick test and used the peaklist from the record file without any rounding or type conversions and gave that to the java SPLASH code.

How should we continue?

package massbank;

import java.util.ArrayList;
import org.junit.jupiter.api.Test;
import edu.ucdavis.fiehnlab.spectra.hash.core.Spectrum;
import edu.ucdavis.fiehnlab.spectra.hash.core.Splash;
import edu.ucdavis.fiehnlab.spectra.hash.core.SplashFactory;
import edu.ucdavis.fiehnlab.spectra.hash.core.types.Ion;
import edu.ucdavis.fiehnlab.spectra.hash.core.types.SpectraType;
import edu.ucdavis.fiehnlab.spectra.hash.core.types.SpectrumImpl;
import static org.junit.jupiter.api.Assertions.assertEquals;

public class SPLASHTest {

    String SPLASH1 = "splash10-001j-9000100300-153a9c6153253ef2a9a5";
    double[][] peakList1 = { 
            { 112.087, 2679171 },
            { 120.0808, 2070870 },
            { 164.1068, 32504938 },
            { 215.1177, 6590179 },
            { 223.1189, 6060016 },
            { 240.1454, 3325411 },
            { 243.1127, 13114276 },
            { 309.1804, 3503857 },
            { 322.1443, 2464506 },
            { 434.2072, 34359048 },
            { 481.2761, 8306463 },
            { 755.3852, 4325881 },
            { 772.4161, 20219846 },
            { 790.4244, 109196096 },
            { 1003.561, 14308928 },
            { 1031.555, 194006896 },
            { 1049.566, 173144816 },
        };

    @Test
    public void testSPLASH1() {
        ArrayList<Ion> ions = new ArrayList<Ion>();
        for (double[] peak : peakList1){
            ions.add(new Ion(peak[0], peak[1]));
        }
        Splash splashFactory = SplashFactory.create();
        Spectrum spectrum = new SpectrumImpl(ions, SpectraType.MS);

        assertEquals(SPLASH1, splashFactory.splashIt(spectrum));
    }

}
schymane commented 1 year ago

Hmm, when we validated the SPLASH originally (2016), all implementations returned the same SPLASH. So which one has changed, the java or the R? I think - unless any evidence points otherwise - the FiehnLab one should be regarded as correct, as the R implementation was a downstream product iirc (@sneumann @egonw would you agree?).

schymane commented 1 year ago

I should have also tagged @berlinguyinca in that as well - Gert please see conversation trace above!

meowcat commented 1 year ago

Hmm, when we validated the SPLASH originally (2016), all implementations returned the same SPLASH

I assume this is a corner case that wasn't thought about? Interestingly this same batch of records seems to have more instances of this issue. I do not recall seeing it before.

schymane commented 1 year ago

No this was exactly the kind of corner case we tried to design out of the SPLASH (ie so that this kind of thing didn't happen). We evaluated all the spectra we could lay our hands on at the time (which was a lot) and we had to redesign the entirety because of a 32-bit/64-bit issue. Is this a new batch of records?

The code for the R implementation must be lying around somewhere ...

schymane commented 1 year ago

https://github.com/berlinguyinca/spectra-hash/tree/master/splashR

Last changed 7-8 years ago (omg I feel old :-D )

https://github.com/berlinguyinca/spectra-hash/blob/master/splashR/R/getSplash.R

schymane commented 1 year ago

How are you generating the SPLASHes, within RMassBank? And then using splashR? Maybe we need to switch to the java implementation? (if we can't find the root cause in splashR)?

To answer my own question, yes, it's splashR build into RMB: https://github.com/MassBank/RMassBank/blob/main/R/getSplash.R

sneumann commented 1 year ago

Hi, indeed, the R version was probably hacked together by me. Doing a JNI Java binding like rCDK does should not be rocket science, but I dodged that back then. Having a single actual implementation was also the way to go for the InChI trust. Yours, Steffen P.S.: Oh, and I think RMB copied that small R code snippet to not depend on another (non-BioC and non-CRAN) package.

meowcat commented 1 year ago

Maybe we need to switch to the java implementation? (if we can't find the root cause in splashR)?

ooooof can we avoid that?

schymane commented 1 year ago

Yes, we have v0.0.3 in RMB, and the only difference between v0.0.3 and v0.0.4 was the addition of a citation file by Egon. All implementations are ticked as validated on the SPLASH side, but obviously something has deviated ...

@meowcat I guess we can avoid if we can find the reason for the deviation in the R code ... (!) but given that that code hasn't changed ... what has changed instead? An underlying package?

meowcat commented 1 year ago

I don't think anything has changed, I think we are hitting a corner case not covered by the validation.

egonw commented 1 year ago

What can have happened is that an underlying dependency (maybe even core R) changed something.

schymane commented 1 year ago

But several at once is very strange ... new instrument? (as in, larger intensity values or something that was beyond what we expected in 2016 even with upscaling at the time?)

meowcat commented 1 year ago

No, interestingly the user reports that a second batch acquired at a similar time on the same instrument does not suffer from the issue. Obviously I also haven't dug super deep yet.

berlinguyinca commented 1 year ago

Hi,

we tested for all these infos in the c, java, python versions. I honestly don't recall who did the R version since its been quite a while. But yeah special attention was giving to not have rounding issues and @Sajjan Mehta @.***> did quite some testing to ensure that this works right.

Maybe something with R itself changed and now it behaves different?

G.

On Tue, Jul 4, 2023, 8:35 AM meowcat @.***> wrote:

No, interestingly the user reports that a second batch acquired at a similar time on the same instrument does not suffer from the issue. Obviously I also haven't dug super deep yet.

— Reply to this email directly, view it on GitHub https://github.com/MassBank/MassBank-web/issues/384#issuecomment-1620456475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAD73HIENL4JZ6OI4N3KPDXOQZ2LANCNFSM6AAAAAAZ5VA7OA . You are receiving this because you were mentioned.Message ID: @.***>

egonw commented 1 year ago

I honestly don't recall who did the R version since its been quite a while.

https://github.com/MassBank/RMassBank/blob/main/DESCRIPTION#L19-L20

sneumann commented 1 year ago

Hi, I am the practical type, I want to have an as-simple-as-possible reproducible test case (spectrum, splashR invocation, expected splash as calculated by java) to be able to try this on all R versions I can get my hands on, including https://hub.docker.com/r/rocker/r-ver/tags?page=1&ordering=-last_updated Once I find different behaviour, we can hunt where it comes from. Yours, Steffen

ssmehta commented 1 year ago

Hi all,

Before releasing SPLASH, I remember we tested all of the implementations (including the R implementation) against a large set of hundreds of thousands of spectra for consistency. A subset was added to the repo: https://github.com/berlinguyinca/spectra-hash/tree/master/base-dataset/spectra/notsplashed

Unfortunately, SPLASH is highly sensitive to precision and rounding, I believe up to 6 decimal places for mass.

I did a quick test of the API (which uses the Java implementation) with the splashR implemention included in the official repo using the peaks.txt file from the gist.

import pandas as pd
import requests

df = pd.read_csv('peaks.csv')
data = {
    'ions': [{'mass': row.mz, 'intensity': row.intensity} for i, row in df.iterrows()],
    'type': 'MS'
}

r = requests.post('https://splash.fiehnlab.ucdavis.edu/splash/it', json=data)
print(r.text)
# splash10-001j-9000100300-53a6cb4ceba8ad387ba1
library(splashR)
peaks <- read.csv("peaks.csv")

getSplash(peaks[,c("mz", "intensity")])
# splash10-001j-9000100300-53a6cb4ceba8ad387ba1"

So the two implementations seem to match okay, although it yields the "wrong" SPLASH.

When I use the peak data directly copied from MSBNK-EAWAG-ED046704.txt:

# peaks-MSBNK-EAWAG-ED046704.csv
mz,intensity
112.087,2679171
120.0808,2070870
164.1068,32504938
215.1177,6590179
223.1189,6060016
240.1454,3325411
243.1127,13114276
309.1804,3503857
322.1443,2464506
434.2072,34359048
481.2761,8306463
755.3852,4325881
772.4161,20219846
790.4244,109196096
1003.561,14308928
1031.555,194006896
1049.566,173144816

splash10-001j-9000100300-153a9c6153253ef2a9a5 is returned by the API and splashR.

I also tested all the rounding variations, and all yielded the correct SPLASH:

library(splashR)

peaks <- read.csv("peaks-MSBNK-EAWAG-ED046704.csv")
message(getSplash(peaks[,c("mz", "intensity")]))
# splash10-001j-9000100300-153a9c6153253ef2a9a5

peaks <- read.csv("peaks-MSBNK-EAWAG-ED046704.csv")
peaks$mz <- round(peaks$mz, 4)
peaks$intensity <- round(peaks$intensity, 1)
message(getSplash(peaks[,c("mz", "intensity")]))
# splash10-001j-9000100300-153a9c6153253ef2a9a5

peaks <- read.csv("peaks-MSBNK-EAWAG-ED046704.csv")
peaks$mz <- round(peaks$mz, 4)
peaks$intensity <- round(peaks$intensity, 0)
message(getSplash(peaks[,c("mz", "intensity")]))
# splash10-001j-9000100300-153a9c6153253ef2a9a5

peaks <- read.csv("peaks-MSBNK-EAWAG-ED046704.csv")
peaks$mz <- round(peaks$mz, 4)
peaks$intensity <- floor(peaks$intensity)
message(getSplash(peaks[,c("mz", "intensity")]))
# splash10-001j-9000100300-153a9c6153253ef2a9a5

peaks <- read.csv("peaks-MSBNK-EAWAG-ED046704.csv")
peaks$mz <- round(peaks$mz, 4)
peaks$intensity <- ceiling(peaks$intensity)
message(getSplash(peaks[,c("mz", "intensity")]))
# splash10-001j-9000100300-153a9c6153253ef2a9a5

peaks <- read.csv("peaks-MSBNK-EAWAG-ED046704.csv")
peaks$mz <- round(peaks$mz, 4)
peaks$intensity <- trunc(peaks$intensity, 1) # I don't actually know what this does
message(getSplash(peaks[,c("mz", "intensity")]))
# splash10-001j-9000100300-153a9c6153253ef2a9a5

Can someone else test this and confirm it?

To me it seems like splashR behaves correctly - are there any differences between the implementations in splashR and RMassBank?

schymane commented 1 year ago

Hi @ssmehta (great to hear from you again, sorry I didn't tag you in earlier, I thought about you immediately!)

When I did a traceback of the code it seems the RMassBank version was 0.0.3 of splashR, and the only difference between 0.0.3 and 0.0.4 was @egonw adding a citation file to the package.

This is the RMassBank snippet: https://github.com/MassBank/RMassBank/blob/main/R/getSplash.R

and the commit message to 0.0.4 https://github.com/berlinguyinca/spectra-hash/commit/c446ccf2b46739ef9da7f32e9b3ea20610f4c3cd

I haven't done an actual diff to confirm there was no change between https://github.com/MassBank/RMassBank/blob/main/R/getSplash.R and https://github.com/berlinguyinca/spectra-hash/blob/master/splashR/R/getSplash.R (aside from the comments at the top)

@meowcat are you able to reproduce the issue?

schymane commented 1 year ago

Oh yes, now I see the issue I think (after reading through for about the 5th time or more...), reflecting on @ssmehta 's comment about the precision and 6 d.p. in the mass:

So peaks.csv has the mass given with many decimal places: image

...and the MassBank peak list produced at the end has this trimmed to our 4 decimal place reporting format. image

(screenshots from the gist)

So ... is this then a question of where / when we calculate the SPLASH in the RMassBank workflow @meowcat?

meier-rene commented 1 year ago

Hi @ssmehta, welcome to the discussion. @schymane: I would love it, if it would be that simple. Unfortunately it is not. The precession issue was my first idea. I drafted that little piece of code:

BiocManager::install("berlinguyinca/spectra-hash", subdir = "splashR")

library(splashR)
spectrum <- cbind(mz=c(112.087, 120.0808, 164.1068,  215.1177,  223.1189,  240.1454,  243.1127,  309.1804,  322.1443, 434.2072,  481.2761,  755.3852,  772.4161,  790.4244, 1003.5611, 1031.5554, 1049.5658),
                  intensity=c(2679171, 2070870, 32504938, 6590179, 6060016, 3325411, 13114276, 3503857, 2464506, 34359048, 8306462, 4325881, 20219846, 109196096, 14308928, 194006896, 173144816))
getSplash(spectrum)

and it gives [1] "splash10-001j-9000100300-840de07e19f6ae3d5b0e" which is different from the java splash10-001j-9000100300-153a9c6153253ef2a9a5. I will make an analysis of different R versions today. docker makes it easy...

schymane commented 1 year ago

...I also think if if were that simple, we would have encountered this much earlier. @ssmetha did a great job testing when we designed the SPLASH, and this cannot be the first time we've had inputs with more d.p. than the output, this should be accounted for in the SPLASH generation. Keep us posted!

ssmehta commented 12 months ago

Hi @meier-rene and @schymane

I think the issue is that in MSBNK-EAWAG-ED046704.txt posted in the original gist, for m/z > 1000 the rounding changes to 3 decimal places (or 7 digits of precision?). Specifically,

  790.4244 109196096 562
  1003.561 14308928 73
  1031.555 194006896 999
  1049.566 173144816 891

So the SPLASH will differ when everything is rounded to 4 decimals.

I'm not familiar with the MassBank validation workflow or at what point this issue is arising, but since m/z values are already rounded in the txt record, it would be best to pass the values exactly as they are presented to the splasher. Any floating point artifacts introduced by parsing the strings should be taken care of by the splasher, so rounding shouldn't be needed.

meier-rene commented 12 months ago

Hi all, sometimes I'm just a little bit blind... @ssmehta actually found the root cause. In the peaklist of the record file the precession of the m/z > 1000 peaks changed from 4 decimal places to 3 decimal places. If you want to preprocess your data frame to get the same peaklist as in the record file you would need to do something like this:

peaks <- read.csv("peaks.csv")
#peaks$mz <- round(peaks$mz, 4)
peaks$mz <- signif(peaks$mz, digits = 7)
peaks$intensity <- round(peaks$intensity, 0)
message(RMassBank:::getSplash(peaks[,c("mz", "intensity")]))
# splash10-001j-9000100300-153a9c6153253ef2a9a5

I haven't found any code with signif in the RMassBank code, thats why I don't know how RMassBank actually formats the peaklist.

Bottom-line: All splashR code works correctly!

meowcat commented 12 months ago

So the issue is the generation of the record, not the calculation of the SPLASH! Nice catch. Yes, this is probably an R antic, which by default prints 7 significant digits:

> print(1234.5678)
[1] 1234.568