MassBank / MassBank-data

Official repository of open data MassBank records
68 stars 55 forks source link

Different data in NIST vs RIKEN formats #234

Closed abefrandsen closed 10 months ago

abefrandsen commented 11 months ago

Hi! I noticed that the most recent data release (2023.06) contains different data depending on which file I download. In particular, the MassBank_NIST.msp file gives collision energy but the MassBank_RIKEN.msp does not. On the other hand, the MassBank_RIKEN.msp file gives retention time whereas the MassBank_NIST.msp file does not. I find this situation confusing, as it's easy to miss this fact and requires more work to join these two files together to get all the data associated with each record. I suggest either:

  1. ensuring both file formats contain every available data field, or if that's somehow impossible
  2. make this distinction between the data formats very clear in the documentation

Any suggestions on how I can ensure I get the full data besides just downloading each file format and joining them together?

tsufz commented 11 months ago

@abefrandsen, Thanks for your suggestions. However, the two formats follow the specifications of NIST.msp for general use in compatible software (e.g. vendors' software, MZmine) and RIKEN for MSDIAL. Hence, the files are compliant with these formats and have a different content. The msp format is old and thus, the metadata domain is a bit restricted.

Btw., the retention needs to be used with caution. The mass spectral data was acquired at diverse conditions, and hence the RT is a more or less arbitrary information (if not using same gradient on similar column).

I dunno, if you require the data in msp format. But there are different ways to use our data:

  1. You can download the text files and parse them for your purposes.
  2. You can download the json or sql and parse the data from those files.
  3. You can customize our parsers and command line tools for your purposes.

You may contribute your code to community project?

Yours, Tobias

schymane commented 11 months ago

Hi @abefrandsen - please note that the data you want may also be available in the files we send for integration with PubChem, these are available on Zenodo and updated with every release. This DOI will always redirect to the latest version: https://doi.org/10.5281/zenodo.5139996 Current version is here: https://zenodo.org/record/8010009

There are additional columns there which contain the context (chromatography etc) which you can see when you download the files (or look at the preview).

image

This is the file that contains the information you need:

image

sneumann commented 11 months ago

Hi, I agree that this is both inconvenient and confusing. Underlying issue is that we try to stick to the data standard, and in e.g. the NIST specification https://www.nist.gov/system/files/documents/srd/NIST1aVer22Man.pdf (section NIST "Text Format of Individual Spectra") there is no notion of RETENTIONTIME as later introduced into the RIKEN flavour by the MS-Dial team. If you point us to a NIST document that specifies the retention time, we can include it.

Similarly, we were not entirely sure about the collision energy in the RIKEN flavour. Looking at http://prime.psc.riken.jp/compms/msdial/download/msp/MSMS-Neg-RikenOxPLs.msp
I see that COLLISIONENERGY: is (now?) used (where NIST has Collision_energy:...). So future exports in RIKEN flavour might include COLLISIONENERGY: .

Also funny how Tobias and Emma replied while I put my reply together :-) Yours, Steffen

schymane commented 11 months ago

I didn't know the details of the two different msps to comment on that ... but there is usually a place for "comment" in the MSP format so this would be another option to add the information if there's no other way. I agree with @abefrandsen that it would seem to me to make sense to include the information in both MSP files if possible (as long as it doesn't break the respective imports) ...

abefrandsen commented 11 months ago

I appreciate the very prompt responses from everyone! Thanks for explaining the underlying issues -- it's somewhat of a shame (in my opinion) that these different data standards aren't flexible enough to support arbitrary data fields (though perhaps the comments field can be used as a catch-all as suggested by @schymane ), but I understand the reason for the difference in the two msp files.

Specifically to @tsufz regarding the json file format -- from what I can tell, the records in the json are also missing key data fields, most consipicuously the peak m/zs and intensities. The only fields I'm see in the json are as follows:

['@context', '@type', '@id', 'http://purl.org/dc/terms/conformsTo',
       'description', 'identifier', 'keywords', 'license', 'name', 'url',
       'datePublished', 'headline', 'measurementTechnique', 'citation',
       'comment', 'alternateName', 'inChI', 'smiles', 'molecularFormula',
       'monoisotopicMolecularWeight', 'inChIKey']

So it seems the json file also needs to be joined with something else in order to get everything (unless I'm missing something). As far as reading the sql dump, it looked like mariadb is needed to do that, correct? I was having trouble properly installing that on my system :(

@schymane, thanks for pointing me toward the Zenodo release files. However, the csv file you highlighted only contains the top 5 peaks for each MS/MS spectrum.

schymane commented 11 months ago

@schymane, thanks for pointing me toward the Zenodo release files. However, the csv file you highlighted only contains the top 5 peaks for each MS/MS spectrum.

Yes this is because PubChem only want to display the top 5 (and including all is a much larger file size). The code we use to create that summary file is hyperlinked at the top of the Zenodo record, if you would take that script and adjust it to retain all peaks instead of trimming to Top 5, you would get a table with the entire output.

https://gitlab.lcsb.uni.lu/eci/pubchem/-/blob/master/massbank_eu/MassBankEU_Export.R#L187

abefrandsen commented 11 months ago

to be clear, I appreciate all the responses and help. But I want to reiterate: it would be amazing to have a single, language-agnostic, standard text file (like json or csv) that contains ALL available data for every spectrum, and doesn't require the user to invoke a particular R script or some other piece of code to assemble (I'm most definitely not an R user haha). I will make do in the meantime though :)

sneumann commented 11 months ago

One more comment about the JSON: that has the Metadata about the records in the Bioschemas format, and is indeed missing the actual spectral data. This is for findability on data search engines, and not for the MS/MS data itself. But yes, I see your point there. Yours, Steffen

meowcat commented 10 months ago

@abefrandsen: The MassBank format itself, to some degree, aspired to be this format! It is formalized to a degree that other formats are not, though it is still obviously not completely machine-readable. If you just need all the data from the records, you can concatenate all the text files in the repo.

Downside is that it is obviously not supported by that many software packages. Creating a new standard frequently leads to this situation.

meier-rene commented 10 months ago

Hi @abefrandsen , after all the comments why we will probably not change the existing MSP files, I would like to point you to a possibility to get all the data in one language agnostic file. There is a exporter in the code which can export all information to a json file. We have this just for internal purpose and it comes without any support. Here are the steps: -get https://github.com/MassBank/MassBank-web and https://github.com/MassBank/MassBank-data repo -go to MassBank-web/MassBank-Project and compile with mvn package -now export exerything to json with ./MassBank-lib/target/MassBank-lib/MassBank-lib/bin/RecordExporter -f json -o MassBank.json <your MassBank data folder>

I will try attach the exported json onetime if github allows that size.

meier-rene commented 10 months ago

Exported json is to big for issue comments. Because I have the file I will provide it one time as an attachment to the release. https://github.com/MassBank/MassBank-data/releases/download/2023.06/MassBank.zip

If it suits your needs you need to create that file for future releases on your own. Please note that we have so called deprecated records in our data repo. They are deleted for a reason, but stay in the repo as a thumbstone to prevent reuse of the accession. I would suggest to remove them manually before you export the data.