Closed abefrandsen closed 10 months ago
@abefrandsen, Thanks for your suggestions. However, the two formats follow the specifications of NIST.msp for general use in compatible software (e.g. vendors' software, MZmine) and RIKEN for MSDIAL. Hence, the files are compliant with these formats and have a different content. The msp format is old and thus, the metadata domain is a bit restricted.
Btw., the retention needs to be used with caution. The mass spectral data was acquired at diverse conditions, and hence the RT is a more or less arbitrary information (if not using same gradient on similar column).
I dunno, if you require the data in msp format. But there are different ways to use our data:
You may contribute your code to community project?
Yours, Tobias
Hi @abefrandsen - please note that the data you want may also be available in the files we send for integration with PubChem, these are available on Zenodo and updated with every release. This DOI will always redirect to the latest version: https://doi.org/10.5281/zenodo.5139996 Current version is here: https://zenodo.org/record/8010009
There are additional columns there which contain the context (chromatography etc) which you can see when you download the files (or look at the preview).
This is the file that contains the information you need:
Hi, I agree that this is both inconvenient and confusing. Underlying issue is that we try to stick to the data standard, and in e.g. the NIST specification https://www.nist.gov/system/files/documents/srd/NIST1aVer22Man.pdf (section NIST "Text Format of Individual Spectra") there is no notion of RETENTIONTIME as later introduced into the RIKEN flavour by the MS-Dial team. If you point us to a NIST document that specifies the retention time, we can include it.
Similarly, we were not entirely sure about the collision energy in the RIKEN flavour. Looking at
http://prime.psc.riken.jp/compms/msdial/download/msp/MSMS-Neg-RikenOxPLs.msp
I see that COLLISIONENERGY:
is (now?) used (where NIST has Collision_energy:
...).
So future exports in RIKEN flavour might include COLLISIONENERGY:
.
Also funny how Tobias and Emma replied while I put my reply together :-) Yours, Steffen
I didn't know the details of the two different msps to comment on that ... but there is usually a place for "comment" in the MSP format so this would be another option to add the information if there's no other way. I agree with @abefrandsen that it would seem to me to make sense to include the information in both MSP files if possible (as long as it doesn't break the respective imports) ...
I appreciate the very prompt responses from everyone! Thanks for explaining the underlying issues -- it's somewhat of a shame (in my opinion) that these different data standards aren't flexible enough to support arbitrary data fields (though perhaps the comments
field can be used as a catch-all as suggested by @schymane ), but I understand the reason for the difference in the two msp files.
Specifically to @tsufz regarding the json file format -- from what I can tell, the records in the json are also missing key data fields, most consipicuously the peak m/zs and intensities. The only fields I'm see in the json are as follows:
['@context', '@type', '@id', 'http://purl.org/dc/terms/conformsTo',
'description', 'identifier', 'keywords', 'license', 'name', 'url',
'datePublished', 'headline', 'measurementTechnique', 'citation',
'comment', 'alternateName', 'inChI', 'smiles', 'molecularFormula',
'monoisotopicMolecularWeight', 'inChIKey']
So it seems the json file also needs to be joined with something else in order to get everything (unless I'm missing something). As far as reading the sql dump, it looked like mariadb
is needed to do that, correct? I was having trouble properly installing that on my system :(
@schymane, thanks for pointing me toward the Zenodo release files. However, the csv file you highlighted only contains the top 5 peaks for each MS/MS spectrum.
@schymane, thanks for pointing me toward the Zenodo release files. However, the csv file you highlighted only contains the top 5 peaks for each MS/MS spectrum.
Yes this is because PubChem only want to display the top 5 (and including all is a much larger file size). The code we use to create that summary file is hyperlinked at the top of the Zenodo record, if you would take that script and adjust it to retain all peaks instead of trimming to Top 5, you would get a table with the entire output.
https://gitlab.lcsb.uni.lu/eci/pubchem/-/blob/master/massbank_eu/MassBankEU_Export.R#L187
to be clear, I appreciate all the responses and help. But I want to reiterate: it would be amazing to have a single, language-agnostic, standard text file (like json or csv) that contains ALL available data for every spectrum, and doesn't require the user to invoke a particular R script or some other piece of code to assemble (I'm most definitely not an R user haha). I will make do in the meantime though :)
One more comment about the JSON: that has the Metadata about the records in the Bioschemas format, and is indeed missing the actual spectral data. This is for findability on data search engines, and not for the MS/MS data itself. But yes, I see your point there. Yours, Steffen
@abefrandsen: The MassBank format itself, to some degree, aspired to be this format! It is formalized to a degree that other formats are not, though it is still obviously not completely machine-readable. If you just need all the data from the records, you can concatenate all the text files in the repo.
Downside is that it is obviously not supported by that many software packages. Creating a new standard frequently leads to this situation.
Hi @abefrandsen , after all the comments why we will probably not change the existing MSP files, I would like to point you to a possibility to get all the data in one language agnostic file. There is a exporter in the code which can export all information to a json file. We have this just for internal purpose and it comes without any support. Here are the steps:
-get https://github.com/MassBank/MassBank-web and https://github.com/MassBank/MassBank-data repo
-go to MassBank-web/MassBank-Project and compile with mvn package
-now export exerything to json with ./MassBank-lib/target/MassBank-lib/MassBank-lib/bin/RecordExporter -f json -o MassBank.json <your MassBank data folder>
I will try attach the exported json onetime if github allows that size.
Exported json is to big for issue comments. Because I have the file I will provide it one time as an attachment to the release. https://github.com/MassBank/MassBank-data/releases/download/2023.06/MassBank.zip
If it suits your needs you need to create that file for future releases on your own. Please note that we have so called deprecated records in our data repo. They are deleted for a reason, but stay in the repo as a thumbstone to prevent reuse of the accession. I would suggest to remove them manually before you export the data.
Hi! I noticed that the most recent data release (2023.06) contains different data depending on which file I download. In particular, the
MassBank_NIST.msp
file gives collision energy but theMassBank_RIKEN.msp
does not. On the other hand, theMassBank_RIKEN.msp
file gives retention time whereas theMassBank_NIST.msp
file does not. I find this situation confusing, as it's easy to miss this fact and requires more work to join these two files together to get all the data associated with each record. I suggest either:Any suggestions on how I can ensure I get the full data besides just downloading each file format and joining them together?