Open ksjewell opened 1 year ago
Honestly, I dont know. Could you please drop the MSBNK-BAFG-CSL23102611413.txt file here for me?
Strange that it's in the first block, I also don't recall seeing this case before...
Thank you. I checked your file. It contains: splash10-0006-9300000000-5cd70311703e2423a1c5 Validator reports it finds splash10-0006-9300000000-5cd70311703e2423a1c5 but wants splash10-052f-9300000000-5cd70311703e2423a1c5.
I expect you get the output shown in your first comment from a run of the validator with multiple files. This software runs multithreaded and sometimes output gets a bit messed up. I expect, that the outputline you found belongs to a different record. And in the output the explanation comes first and then the filename, see below a single file validation.
We focus instead on the output of the validation of a single file. You are right: There is a missmatch about the SPLASH calculated by RMassBank and the one from the Validator.
Validator version: 2.2.5-SNAPSHOT
14:12:50.497 ERROR massbank.cli.Validator - SPLASH from record file does not match SPLASH calculated from peaklist. splash10-0006-9300000000-5cd70311703e2423a1c5 defined in record file, but splash10-052f-9300000000-5cd70311703e2423a1c5 calculated from peaks.
14:12:50.499 ERROR massbank.cli.Validator - ACCESSION: MSBNK-BAFG-CSL23102611413
14:12:50.499 ERROR massbank.cli.Validator - ^
14:12:50.499 ERROR massbank.cli.Validator - Error in 'MSBNK-BAFG-CSL23102611413.txt'.
I need to dig a little bit deeper.
Alright, seems you will solve it soon. Just as a heads-up, I used splashR to compute the Splash.
Interesting, https://splash.fiehnlab.ucdavis.edu/ gives
...and it only worked on those numbers, returned a format error on the middle column only.
We recently had a similar issue https://github.com/MassBank/MassBank-web/issues/384 and it was related to zeros somehow. What happens in your R Object if you remove the 0 in the first row?
I thought of that issue too, but this is affecting the first block this time, not the third one - which is really strange. Is it related to the middle column somehow (all entries are below 1)
Tagging in @berlinguyinca and @ssmehta again ;-)
We need to solve that issue on the R side.
curl -d '{ "ions": [ {"mass": 44.998, "intensity": 0.2 }, {"mass": 80.0261, "intensity": 0.1 }, {"mass": 93.0321, "intensity": 0.4 }, {"mass": 108.0227, "intensity": 0.3 } ], "type": "MS"}' -H "Content-Type: application/json" https://splash.fiehnlab.ucdavis.edu/splash/it
splash10-052f-9300000000-5cd70311703e2423a1c5
The REST endpoints agrees with the java implementation. And the 44.9980 gives the same. I will read the old issue again very carefully.
I can't find a way in R to skip the first 0 in 44.9980 but leave the others unchanged. If I round everything to 3 decimal places, I also get the incorrect splash
Please don't round to 3 dp! That will for sure change the splash (but also the final hash block too, right?). The first block is a summary block, it makes no sense why it would change so dramatically ... it should not be sensitive to a 0.
In the second and third blocks, intensities are summed over fixed (but different) bin sizes and wrapped over ten bins. The wrapped bin (zero-based) index for a given ion is computed as floor (m/z ÷ BinSize) modulo 10. This wrapping strategy accommodates all possible spectral mass ranges while maintaining fixed-length summary blocks.
From the article ... the second block (wrapped bin) is the one that's changing: 052f
vs 0006
Looking at the failing file, I note that your absolute intensities are all <1. Is this how Sciex reports them? Does that have anything to do with the issue?
This is how Sciex converts them to mzXML. I believe in the native Sciex format, the numbers are higher.
@meowcat great finding. this means this issue should go to the R implementation at https://github.com/berlinguyinca/spectra-hash? Besides that, any chance that we get higher intensities out of the Sciex export for now? I expect you use ProteoWizard for the conversion?
I can just change the intensities temporarily to create the splash, no?
You dont need to bother about the SPLASH issue, because I can easily fix that on the txt files. If you think your files are fine and only some SPLASH are broken, please reopen your PR.
I expect that there is a fix required to the SPLASH library to solve that issue on the RMassBank side.
@ksjewell Since you import the records in MsBackendMassbank and then export them again (right?), you could in fact recalculate the splash there, yes. Something like
spectraData(sp)$splash <- map_chr(peaksData(sp), function(pks) {
pks[,2] <- pks[,2] * 1000
RMassBank:::getSplash(pks)
}
I expect that there is a fix required to the SPLASH library to solve that issue on the RMassBank side.
yep; though best would be to get the fix in the original SPLASH lib and port it identically, so we don't have two different implementations of the fix. I hope multiplying by 1k will not break a few other SPLASHes because of rounding issues
yep; though best would be to get the fix in the original SPLASH lib and port it identically, so we don't have two different implementations of the fix.
I agree, thats why I opened a issue at the splash package repo.
I think I am making progress but there is still one single Validator error left (this is after multiplying intensity by 1000) Since it is just one file I will change the i to an l and be done with it :). But you know, in case it helps:
20:09:06.617 ERROR massbank.cli.Validator - SPLASH from record file does not match SPLASH calculated from peaklist. splash10-014i-9000000000-508039bd516ba9b5a8ab defined in record file, but splash10-014l-9000000000-508039bd516ba9b5a8ab calculated from peaks.
Here is the file:
ACCESSION: MSBNK-BAFG-CSL231109456
RECORD_TITLE: Benzyl-dimethyl-decylammonium; LC-ESI-QTOF; MS2; 150 V
DATE: 2023.11.09
AUTHORS: Kevin S. Jewell; Björn Ehlig; Arne Wick
LICENSE: dl-de/by-2-0
COPYRIGHT: Copyright 2023 Federal Institute of Hydrology, Koblenz, Germany
COMMENT: CONFIDENCE Reference Standard (Level 1)
COMMENT: Chromatography method: dx.doi.org/10.1016/j.chroma.2015.11.014
COMMENT: Acquisition method: 10.1002/rcm.8541
CH$NAME: Benzyl-dimethyl-decylammonium
CH$COMPOUND_CLASS: Antimicrobial; Pharmaceutical
CH$FORMULA: [C19H34N]+
CH$EXACT_MASS: 276.2686
CH$SMILES: CCCCCCCCCC[N+](C)(C)Cc1ccccc1
CH$IUPAC: InChI=1S/C19H34N/c1-4-5-6-7-8-9-10-14-17-20(2,3)18-19-15-12-11-13-16-19/h11-13,15-16H,4-10,14,17-18H2,1-3H3/q+1
CH$LINK: CAS 48185-25-7
CH$LINK: INCHIKEY UARILQSOMYIQCM-UHFFFAOYSA-N
AC$INSTRUMENT: TripleTOF 5600 SCIEX
AC$INSTRUMENT_TYPE: LC-ESI-QTOF
AC$MASS_SPECTROMETRY: MS_TYPE MS2
AC$MASS_SPECTROMETRY: ION_MODE POSITIVE
AC$MASS_SPECTROMETRY: COLLISION_ENERGY 150
AC$MASS_SPECTROMETRY: FRAGMENTATION_MODE CID
AC$MASS_SPECTROMETRY: IONIZATION ESI
AC$CHROMATOGRAPHY: COLUMN_NAME Zorbax Eclipse Plus C18 2.1 mm x 150 mm, 3.5 um, Agilent
AC$CHROMATOGRAPHY: COLUMN_TEMPERATURE 40 °C
AC$CHROMATOGRAPHY: FLOW_GRADIENT 0 min min 98% A, 1 min 98% A, 2 min 80% A, 16.5 min 2% A, 22 min 2% A, 22.1 min 98% A, 27 min 98% A
AC$CHROMATOGRAPHY: FLOW_RATE 0.3 mL/min
AC$CHROMATOGRAPHY: RETENTION_TIME 10.366 min
AC$CHROMATOGRAPHY: SOLVENT A: Water 0.1% Formic acid, B: Acetonitrile 0.1% Formic acid
MS$FOCUSED_ION: PRECURSOR_M/Z 276.2686
MS$FOCUSED_ION: PRECURSOR_TYPE [M]+
MS$DATA_PROCESSING: COMMENT Export with Spectra 1.9.12 MsBackendMassbank 1.7.4
MS$DATA_PROCESSING: WHOLE RMassBank 2.3.1
PK$SPLASH: splash10-014i-9000000000-508039bd516ba9b5a8ab
PK$NUM_PEAK: 4
PK$PEAK: m/z int. rel.int.
42.0443 4.6 142
58.0706 7.6 235
65.0436 32.2 999
91.0554 11.5 356
//
Annoyingly, the SPLASH website won't take it (which may be a clue in itself, this happened before too). I've tried several variants.
The 999-scaled values give the i
variant:
So does that mean the i variant is correct in this case and the Validator is incorrect?
Not sure, need @meier-rene 's opinion on this... it's strange that it doesn't work at all with the decimals...
Hi, I can confirm that the online calculator https://splash.fiehnlab.ucdavis.edu/ is unhappy about decimals for intensities. decimals in m/z are fine there. IIRC the online calculator uses the scala implementation. Yours, Steffen
good afternoon,
I can confirm that the splash requires the intensity to be provided as an integer on the website. But frankly, I cannot for the love of it I can't remember why we decided to go with integers over double/floats for intensity values on the website. The actual code to generate the splash, accepts doubles just fine, based on the input string
public static Spectrum convertStringToSpectrum(String spectra,
SpectraType type, String origin) {
String[] pairs = spectra.split(" ");
List
for(int var7 = 0; var7 < var6; ++var7) {
String pair = var5[var7];
String[] p = pair.split(":");
Double m = Double.parseDouble(p[0]);
Double intensity = Double.parseDouble(p[1]);
ionList.add(new Ion(m, intensity));
}
SpectrumImpl impl = new SpectrumImpl(ionList, type);
impl.setOrigin(origin);
return impl;
}
so there is no reason why the website complains about it, except someone writing a wrong regular expression to validate the input.
g.
On Fri, Nov 10, 2023 at 3:14 AM Steffen Neumann @.***> wrote:
Hi, I can confirm that the online calculator https://splash.fiehnlab.ucdavis.edu/ is unhappy about decimals for intensities. decimals in m/z are fine there. IIRC the online calculator uses the scala implementation. Yours, Steffen
— Reply to this email directly, view it on GitHub https://github.com/MassBank/MassBank-data/issues/248#issuecomment-1805533467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAD73DJNMRI7EEJGIKBECDYDYECHAVCNFSM6AAAAAA6SVKK3SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBVGUZTGNBWG4 . You are receiving this because you were mentioned.Message ID: @.***>
Lead Developer - Fiehnlab, UC Davis
gert wohlgemuth
work: http://fiehnlab.ucdavis.edu/staff/wohlgemuth
linkedin:
Ok, digging a bit further ... so far we used the online splash calculator that takes the peaklist as kinda CSV, and which complains about non-integer intensities due to the input validation. Using the REST call we get for the spectrum in https://github.com/MassBank/MassBank-data/issues/248#issuecomment-1804649225:
curl -X POST -H 'Content-Type: application/json' -d '{"ions":[{"mass": 42.0443, "intensity": 4.6},{"mass": 58.0706, "intensity": 7.6},{"mass": 65.0436, "intensity": 32.2},{"mass": 91.0554, "intensity": 11.5}], "type": "MS"}' https://splash.fiehnlab.ucdavis.edu/splash/it ; echo
splash10-014l-9000000000-508039bd516ba9b5a8ab
which is the same value as the massbank validator ... splash10-014l-9000000000-508039bd516ba9b5a8ab calculated from peaks.
. So I added a unit test to splashR checking this output:
https://github.com/berlinguyinca/spectra-hash/pull/51/files and it gets the correct result.
I also checked that both splashR and the splash code we copy&pasted into RMassBank give identical results:
> spectrum <- cbind(mz=c(42.0443, 58.0706, 65.0436, 91.0554), intensity=c(4.6, 7.6, 32.2, 11.5))
> splashR:::getSplash(spectrum)
[1] "splash10-014l-9000000000-508039bd516ba9b5a8ab"
> RMassBank:::getSplash(spectrum)
[1] "splash10-014l-9000000000-508039bd516ba9b5a8ab"
> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
... other attached packages:
[1] RMassBank_3.11.1.1 Rcpp_1.0.10 splashR_0.0.3 digest_0.6.31
So I get the feeling RMassBank passes something weird to getSplash()
.
I would need @ksjewell 's help to run the RMassBank with some more diagnostic output to capture what values are used for this record in
https://github.com/MassBank/RMassBank/blob/3b61006a1a4bac9c94e780ad82834a1dae9ce417/R/createMassBank.R#L1556
Simplest would be to add the following line to save the peaks for the offending record:
if (mbdata[["PK$SPLASH"]]=="splash10-014i-9000000000-508039bd516ba9b5a8ab") save(peaks, file="peaks-splash10-014i-9000000000-508039bd516ba9b5a8ab.Rdata")
That'd be highly appreciated, please ping me if you need help. Yours, Steffen
Hi all,
Regarding the initial problem in this issue relating to inconsistent histograms, I believe this was due to a a missing binning correction factor in splashR. I submitted a PR which should fix this: https://github.com/berlinguyinca/spectra-hash/pull/52
For the second spectrum, I agree with @sneumann that it doesn't seem to be an issue with SPLASH. I tried some variations of intensities and could only produce 014l
as the prefilter histogram, with and without the histogram fix I submitted. Hopefully with some more information we can track down that discrepancy.
Best, Sajjan
Hi René,
I am getting the following Validator error:
I checked the file and the actual splash in the file is: ´splash10-0006-9300000000-5cd70311703e2423a1c5´
I ran the code separately and indeed this is the splash I get when I run:
So I not only don't understand where it is getting the splash ´splash10-0gx3-9000000000-fdf8d511e2f88d17c82e´ from, I also do not understand why it is computing ´splash10-0w3u-9000000000-fdf8d511e2f88d17c82e´, a different one than I am.