MassBank / MassBank-data

Official repository of open data MassBank records
68 stars 55 forks source link

Validator error in Splash #248

Open ksjewell opened 8 months ago

ksjewell commented 8 months ago

Hi René,

I am getting the following Validator error:

10:33:21.420 ERROR massbank.cli.Validator - ACCESSION: MSBNK-BAFG-CSL23102611413
10:33:21.420 ERROR massbank.cli.Validator - ^
10:33:21.420 ERROR massbank.cli.Validator - Error in 'BAFG/MSBNK-BAFG-CSL23102611413.txt'.
10:33:21.473 ERROR massbank.cli.Validator - SPLASH from record file does not match SPLASH calculated from peaklist. splash10-0gx3-9000000000-fdf8d511e2f88d17c82e defined in record file, but splash10-0w3u-9000000000-fdf8d511e2f88d17c82e calculated from peaks.

I checked the file and the actual splash in the file is: ´splash10-0006-9300000000-5cd70311703e2423a1c5´

I ran the code separately and indeed this is the splash I get when I run:

Browse[1]> spec
        mz intensity
1  44.9980       0.2
2  80.0261       0.1
3  93.0321       0.4
4 108.0227       0.3
Browse[1]> splashR::getSplash(spec)
[1] "splash10-0006-9300000000-5cd70311703e2423a1c5"

So I not only don't understand where it is getting the splash ´splash10-0gx3-9000000000-fdf8d511e2f88d17c82e´ from, I also do not understand why it is computing ´splash10-0w3u-9000000000-fdf8d511e2f88d17c82e´, a different one than I am.

meier-rene commented 8 months ago

Honestly, I dont know. Could you please drop the MSBNK-BAFG-CSL23102611413.txt file here for me?

schymane commented 8 months ago

Strange that it's in the first block, I also don't recall seeing this case before...

ksjewell commented 8 months ago

MSBNK-BAFG-CSL23102611413.txt

meier-rene commented 8 months ago

Thank you. I checked your file. It contains: splash10-0006-9300000000-5cd70311703e2423a1c5 Validator reports it finds splash10-0006-9300000000-5cd70311703e2423a1c5 but wants splash10-052f-9300000000-5cd70311703e2423a1c5.

I expect you get the output shown in your first comment from a run of the validator with multiple files. This software runs multithreaded and sometimes output gets a bit messed up. I expect, that the outputline you found belongs to a different record. And in the output the explanation comes first and then the filename, see below a single file validation.

We focus instead on the output of the validation of a single file. You are right: There is a missmatch about the SPLASH calculated by RMassBank and the one from the Validator.

Validator version: 2.2.5-SNAPSHOT
14:12:50.497 ERROR massbank.cli.Validator - SPLASH from record file does not match SPLASH calculated from peaklist. splash10-0006-9300000000-5cd70311703e2423a1c5 defined in record file, but splash10-052f-9300000000-5cd70311703e2423a1c5 calculated from peaks.
14:12:50.499 ERROR massbank.cli.Validator - ACCESSION: MSBNK-BAFG-CSL23102611413
14:12:50.499 ERROR massbank.cli.Validator - ^
14:12:50.499 ERROR massbank.cli.Validator - Error in 'MSBNK-BAFG-CSL23102611413.txt'.

I need to dig a little bit deeper.

ksjewell commented 8 months ago

Alright, seems you will solve it soon. Just as a heads-up, I used splashR to compute the Splash.

schymane commented 8 months ago

Interesting, https://splash.fiehnlab.ucdavis.edu/ gives image

...and it only worked on those numbers, returned a format error on the middle column only.

meier-rene commented 8 months ago

We recently had a similar issue https://github.com/MassBank/MassBank-web/issues/384 and it was related to zeros somehow. What happens in your R Object if you remove the 0 in the first row?

schymane commented 8 months ago

I thought of that issue too, but this is affecting the first block this time, not the third one - which is really strange. Is it related to the middle column somehow (all entries are below 1)

image

Tagging in @berlinguyinca and @ssmehta again ;-)

meier-rene commented 8 months ago

We need to solve that issue on the R side.

curl -d '{ "ions": [ {"mass": 44.998, "intensity": 0.2 }, {"mass": 80.0261, "intensity": 0.1 }, {"mass": 93.0321, "intensity": 0.4 }, {"mass": 108.0227, "intensity": 0.3 } ], "type": "MS"}' -H "Content-Type: application/json"  https://splash.fiehnlab.ucdavis.edu/splash/it 
splash10-052f-9300000000-5cd70311703e2423a1c5

The REST endpoints agrees with the java implementation. And the 44.9980 gives the same. I will read the old issue again very carefully.

ksjewell commented 8 months ago

I can't find a way in R to skip the first 0 in 44.9980 but leave the others unchanged. If I round everything to 3 decimal places, I also get the incorrect splash

schymane commented 8 months ago

Please don't round to 3 dp! That will for sure change the splash (but also the final hash block too, right?). The first block is a summary block, it makes no sense why it would change so dramatically ... it should not be sensitive to a 0.

schymane commented 8 months ago

In the second and third blocks, intensities are summed over fixed (but different) bin sizes and wrapped over ten bins. The wrapped bin (zero-based) index for a given ion is computed as floor (m/z ÷ BinSize) modulo 10. This wrapping strategy accommodates all possible spectral mass ranges while maintaining fixed-length summary blocks.

From the article ... the second block (wrapped bin) is the one that's changing: 052f vs 0006

meowcat commented 8 months ago

Looking at the failing file, I note that your absolute intensities are all <1. Is this how Sciex reports them? Does that have anything to do with the issue?

ksjewell commented 8 months ago

This is how Sciex converts them to mzXML. I believe in the native Sciex format, the numbers are higher.

meowcat commented 8 months ago

Yep, that's it https://gist.github.com/meowcat/e88b6031ef52cc036576669c1330605f

meier-rene commented 8 months ago

@meowcat great finding. this means this issue should go to the R implementation at https://github.com/berlinguyinca/spectra-hash? Besides that, any chance that we get higher intensities out of the Sciex export for now? I expect you use ProteoWizard for the conversion?

ksjewell commented 8 months ago

I can just change the intensities temporarily to create the splash, no?

meier-rene commented 8 months ago

You dont need to bother about the SPLASH issue, because I can easily fix that on the txt files. If you think your files are fine and only some SPLASH are broken, please reopen your PR.

I expect that there is a fix required to the SPLASH library to solve that issue on the RMassBank side.

meowcat commented 8 months ago

@ksjewell Since you import the records in MsBackendMassbank and then export them again (right?), you could in fact recalculate the splash there, yes. Something like

spectraData(sp)$splash <- map_chr(peaksData(sp), function(pks) {
 pks[,2] <- pks[,2] * 1000
 RMassBank:::getSplash(pks)
}

I expect that there is a fix required to the SPLASH library to solve that issue on the RMassBank side.

yep; though best would be to get the fix in the original SPLASH lib and port it identically, so we don't have two different implementations of the fix. I hope multiplying by 1k will not break a few other SPLASHes because of rounding issues

meier-rene commented 8 months ago

yep; though best would be to get the fix in the original SPLASH lib and port it identically, so we don't have two different implementations of the fix.

I agree, thats why I opened a issue at the splash package repo.

ksjewell commented 7 months ago

I think I am making progress but there is still one single Validator error left (this is after multiplying intensity by 1000) Since it is just one file I will change the i to an l and be done with it :). But you know, in case it helps:

20:09:06.617 ERROR massbank.cli.Validator - SPLASH from record file does not match SPLASH calculated from peaklist. splash10-014i-9000000000-508039bd516ba9b5a8ab defined in record file, but splash10-014l-9000000000-508039bd516ba9b5a8ab calculated from peaks.

Here is the file:


ACCESSION: MSBNK-BAFG-CSL231109456
RECORD_TITLE: Benzyl-dimethyl-decylammonium; LC-ESI-QTOF; MS2; 150 V
DATE: 2023.11.09
AUTHORS: Kevin S. Jewell; Björn Ehlig; Arne Wick
LICENSE: dl-de/by-2-0
COPYRIGHT: Copyright 2023 Federal Institute of Hydrology, Koblenz, Germany
COMMENT: CONFIDENCE Reference Standard (Level 1)
COMMENT: Chromatography method: dx.doi.org/10.1016/j.chroma.2015.11.014
COMMENT: Acquisition method: 10.1002/rcm.8541
CH$NAME: Benzyl-dimethyl-decylammonium
CH$COMPOUND_CLASS: Antimicrobial; Pharmaceutical
CH$FORMULA: [C19H34N]+
CH$EXACT_MASS: 276.2686
CH$SMILES: CCCCCCCCCC[N+](C)(C)Cc1ccccc1
CH$IUPAC: InChI=1S/C19H34N/c1-4-5-6-7-8-9-10-14-17-20(2,3)18-19-15-12-11-13-16-19/h11-13,15-16H,4-10,14,17-18H2,1-3H3/q+1
CH$LINK: CAS 48185-25-7
CH$LINK: INCHIKEY UARILQSOMYIQCM-UHFFFAOYSA-N
AC$INSTRUMENT: TripleTOF 5600 SCIEX
AC$INSTRUMENT_TYPE: LC-ESI-QTOF
AC$MASS_SPECTROMETRY: MS_TYPE MS2
AC$MASS_SPECTROMETRY: ION_MODE POSITIVE
AC$MASS_SPECTROMETRY: COLLISION_ENERGY 150
AC$MASS_SPECTROMETRY: FRAGMENTATION_MODE CID
AC$MASS_SPECTROMETRY: IONIZATION ESI
AC$CHROMATOGRAPHY: COLUMN_NAME Zorbax Eclipse Plus C18 2.1 mm x 150 mm, 3.5 um, Agilent
AC$CHROMATOGRAPHY: COLUMN_TEMPERATURE 40 °C
AC$CHROMATOGRAPHY: FLOW_GRADIENT 0 min min 98% A, 1 min 98% A, 2 min 80% A, 16.5 min 2% A, 22 min 2% A, 22.1 min 98% A, 27 min 98% A
AC$CHROMATOGRAPHY: FLOW_RATE 0.3 mL/min
AC$CHROMATOGRAPHY: RETENTION_TIME 10.366 min
AC$CHROMATOGRAPHY: SOLVENT A: Water 0.1% Formic acid, B: Acetonitrile 0.1% Formic acid
MS$FOCUSED_ION: PRECURSOR_M/Z 276.2686
MS$FOCUSED_ION: PRECURSOR_TYPE [M]+
MS$DATA_PROCESSING: COMMENT Export with Spectra 1.9.12 MsBackendMassbank 1.7.4
MS$DATA_PROCESSING: WHOLE RMassBank 2.3.1
PK$SPLASH: splash10-014i-9000000000-508039bd516ba9b5a8ab
PK$NUM_PEAK: 4
PK$PEAK: m/z int. rel.int.
  42.0443 4.6 142
  58.0706 7.6 235
  65.0436 32.2 999
  91.0554 11.5 356
//
schymane commented 7 months ago

Annoyingly, the SPLASH website won't take it (which may be a clue in itself, this happened before too). I've tried several variants. image

The 999-scaled values give the i variant: image

ksjewell commented 7 months ago

So does that mean the i variant is correct in this case and the Validator is incorrect?

schymane commented 7 months ago

Not sure, need @meier-rene 's opinion on this... it's strange that it doesn't work at all with the decimals...

sneumann commented 7 months ago

Hi, I can confirm that the online calculator https://splash.fiehnlab.ucdavis.edu/ is unhappy about decimals for intensities. decimals in m/z are fine there. IIRC the online calculator uses the scala implementation. Yours, Steffen

berlinguyinca commented 7 months ago

good afternoon,

I can confirm that the splash requires the intensity to be provided as an integer on the website. But frankly, I cannot for the love of it I can't remember why we decided to go with integers over double/floats for intensity values on the website. The actual code to generate the splash, accepts doubles just fine, based on the input string

public static Spectrum convertStringToSpectrum(String spectra, SpectraType type, String origin) { String[] pairs = spectra.split(" "); List ionList = new ArrayList(200); String[] var5 = pairs; int var6 = pairs.length;

for(int var7 = 0; var7 < var6; ++var7) {
    String pair = var5[var7];
    String[] p = pair.split(":");
    Double m = Double.parseDouble(p[0]);
    Double intensity = Double.parseDouble(p[1]);
    ionList.add(new Ion(m, intensity));
}

SpectrumImpl impl = new SpectrumImpl(ionList, type);
impl.setOrigin(origin);
return impl;

}

so there is no reason why the website complains about it, except someone writing a wrong regular expression to validate the input.

g.

On Fri, Nov 10, 2023 at 3:14 AM Steffen Neumann @.***> wrote:

Hi, I can confirm that the online calculator https://splash.fiehnlab.ucdavis.edu/ is unhappy about decimals for intensities. decimals in m/z are fine there. IIRC the online calculator uses the scala implementation. Yours, Steffen

— Reply to this email directly, view it on GitHub https://github.com/MassBank/MassBank-data/issues/248#issuecomment-1805533467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAD73DJNMRI7EEJGIKBECDYDYECHAVCNFSM6AAAAAA6SVKK3SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBVGUZTGNBWG4 . You are receiving this because you were mentioned.Message ID: @.***>

--

Lead Developer - Fiehnlab, UC Davis

gert wohlgemuth

work: http://fiehnlab.ucdavis.edu/staff/wohlgemuth

linkedin:

https://www.linkedin.com/in/berlinguyinca

sneumann commented 7 months ago

Ok, digging a bit further ... so far we used the online splash calculator that takes the peaklist as kinda CSV, and which complains about non-integer intensities due to the input validation. Using the REST call we get for the spectrum in https://github.com/MassBank/MassBank-data/issues/248#issuecomment-1804649225:

curl -X POST -H 'Content-Type: application/json' -d '{"ions":[{"mass": 42.0443, "intensity": 4.6},{"mass": 58.0706, "intensity": 7.6},{"mass": 65.0436, "intensity": 32.2},{"mass": 91.0554, "intensity": 11.5}], "type": "MS"}' https://splash.fiehnlab.ucdavis.edu/splash/it  ; echo
splash10-014l-9000000000-508039bd516ba9b5a8ab

which is the same value as the massbank validator ... splash10-014l-9000000000-508039bd516ba9b5a8ab calculated from peaks.. So I added a unit test to splashR checking this output: https://github.com/berlinguyinca/spectra-hash/pull/51/files and it gets the correct result.

I also checked that both splashR and the splash code we copy&pasted into RMassBank give identical results:

> spectrum <- cbind(mz=c(42.0443, 58.0706, 65.0436, 91.0554), intensity=c(4.6, 7.6, 32.2, 11.5))
> splashR:::getSplash(spectrum)
[1] "splash10-014l-9000000000-508039bd516ba9b5a8ab"
> RMassBank:::getSplash(spectrum)
[1] "splash10-014l-9000000000-508039bd516ba9b5a8ab"
> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
... other attached packages:
[1] RMassBank_3.11.1.1 Rcpp_1.0.10        splashR_0.0.3      digest_0.6.31     

So I get the feeling RMassBank passes something weird to getSplash(). I would need @ksjewell 's help to run the RMassBank with some more diagnostic output to capture what values are used for this record in https://github.com/MassBank/RMassBank/blob/3b61006a1a4bac9c94e780ad82834a1dae9ce417/R/createMassBank.R#L1556 Simplest would be to add the following line to save the peaks for the offending record:

if (mbdata[["PK$SPLASH"]]=="splash10-014i-9000000000-508039bd516ba9b5a8ab") save(peaks, file="peaks-splash10-014i-9000000000-508039bd516ba9b5a8ab.Rdata")

That'd be highly appreciated, please ping me if you need help. Yours, Steffen

ssmehta commented 7 months ago

Hi all,

Regarding the initial problem in this issue relating to inconsistent histograms, I believe this was due to a a missing binning correction factor in splashR. I submitted a PR which should fix this: https://github.com/berlinguyinca/spectra-hash/pull/52

For the second spectrum, I agree with @sneumann that it doesn't seem to be an issue with SPLASH. I tried some variations of intensities and could only produce 014l as the prefilter histogram, with and without the histogram fix I submitted. Hopefully with some more information we can track down that discrepancy.

Best, Sajjan