MassBank / MassBank-data

Official repository of open data MassBank records
68 stars 55 forks source link

Provide MassBank as SQLite database #181

Open jorainer opened 2 years ago

jorainer commented 2 years ago

For some users a MySQL database might be a little to demanding (see also https://github.com/sneumann/xcms/issues/534). Would be nice to provide the MassBank data in addition as an SQLite database. This would be ~ straight forward:

Convert the MySQL dump to SQLite SQL calls using mysql2sqlite

./mysql2sqlite MassBank.sql | sqlite3 MassBank-2021.03.db

The only problem is that the views are not correctly converted (actually, seems mysql2sqlite ignores them completely). To fix that one could simply insert the views later using:

sqlite3 MassBank-2021.03.db < views.sql 

Where the views.sql file is simply a file containing the create view... calls from the 01-init-massbank.sql script.

With this SQLite database it would be super-easy to use the MassBank data in R:

library(MsBackendMassbank)
library(RSQLite)
con <- dbConnect(SQLite(), "MassBank-2021.03.db")
sps <- Spectra(con, source = MsBackendMassbankSql())
sps
MSn data (Spectra) with 86576 spectra in a MsBackendMassbankSql backend:
        msLevel precursorMz  polarity
      <integer>   <numeric> <integer>
1             2      179.07         1
2             2      179.07         1
...         ...         ...       ...
86575        NA          NA         0
86576        NA          NA         0
 ... 42 more variables/columns.
 Use  'spectraVariables' to list all of them.

so, one would have full access to MassBank (cc @tsufz ).

YANGJJ93research commented 2 years ago

Dear @jorainer, I was wondering if where is the massbank sql data from? Is it the compilation of all spectra as provided on the massbank web page? I am asking this for the sake of confidence in my target identification.

tsufz commented 2 years ago

Hey @YangjjMSresearch, you can find the massbank.sql at our GitHub site. It is available beginning with database version 2020.11. We usually announce new releases by Twitter or the MassBank Europe website or you can watch the MassBank data repository.

Best wishes, Tobias

YANGJJ93research commented 2 years ago

@tsufz Noted with many thanks! I noticed that there are around 80000 spectra founded from the massbank.sql database. From the MoNA website, I see there are 196,159 spectra inside. Is this massbank.sql repository different from the MoNA repository?

Best regards, Junjie

tsufz commented 2 years ago

@YangjjMSresearch. I reviewed MassBank of America, They actually provide 73 k MassBank Europe records and in total 175 k GC -and LC-records. MoNA and MassBank hold different datasets. MassBank Europe records are only a part of it among GNPS, HMDB, ReSPECT and others. Thus, the contents of MassBank Europe and MoNA are different. I am also not sure about the update frequency of MoNA.

The structure of the sql files are also different. We provide the dump of our internal database. MoNA provides a dump of their database. @jorainer may explain, it if the MoNA sql files are also usuable.

If you want to use MassBank Europe records only, use our databases, please. They guarantee reproducibility, as we provide versioned releases. In addition, you may use MoNA sql files containing different other libraries as appropriate.

Best wishes, Tobias

jorainer commented 2 years ago

I've never looked into MoNa sql files - I was even not aware that they provide their data as SQL. Maybe I have a look into that someday

meowcat commented 2 years ago

Hi all, I still think it would be great to serve the SQLite that can be used with RforMassSpectrometry directly, otherwise the users have to do the conversion themselves...

jorainer commented 2 years ago

Agree @meowcat ! note that I have one pre-build database here: https://github.com/jorainer/SpectraTutorials/releases/tag/2021.03 . I've also included a super-simple short function to MsBackendMassbank that extracts the databases tables from the MySQL MassBank and stores it into an SQLite database: https://github.com/rformassspectrometry/MsBackendMassbank/blob/master/inst/scripts/massbank-to-sqlite.R

My other plan is (if I finally find the time) to create such MassBank SQLite versions (maybe as CompDb databases, because they use the same database layout) for each new MassBank and distribute them via Bioconductor's AnnotationHub. That would make it super-simple for users to search for and install any MassBank release.

meowcat commented 1 year ago

Hi, I made a small converter that takes MassBank dumps and converts to SQLite as well as mzVault formats. It's a docker and has no external requirements.

https://github.com/meowcat/MassBank-convert

Note that the mzVault converter collapses compound information by InChIKey. A further function which isn't working well collapses by 1D inchikey, but is deactivated in the config.

Can we add that to some CI that generates "best-effort community-contributed conversions"?

sneumann commented 1 year ago

+1 on "best-effort community-contributed conversions", the OpenMS Team has an mzML converter mentioned in https://github.com/MassBank/MassBank-data/issues/31. Yours, Steffen