MassBank / MassBank-data

Official repository of open data MassBank records
74 stars 59 forks source link

sqlite as export format for MassBank-data #32

Closed sneumann closed 3 years ago

sneumann commented 5 years ago

Hi, @Tomnl has updated his code to convert MassBank records to a sqlite database:

I have tidied up the MSP to SQLite python code and included it as separate python package maintained in pip, see docs https://msp2db.readthedocs.io/en/latest/ and code https://github.com/computational-metabolomics/msp2db The code can be used as CLI or API to create an SQLite database from MSP files By default, it can work with either MSP format found in MassBank github or from MoNA. You just need to assign the either "massbank" or "mona" the schema parameter. ... I have updated the documentation https://bioconductor.org/packages/devel/bioc/vignettes/msPurity/inst/doc/msPurity-spectral-matching-vignette.html for msPurity in Bioconductor (development branch) Includes reference to msp2db documentation and details the databases in more detail I have created SQLite databases locally from MassBank and MoNA I am in the process of getting a suitably sized updated SQLite file for msPurityData Bioconductor data package. Please let me know if you have any questions. And I will keep you informed of any other developments.

It would be great to distribute snapshots of MassBank-data in such a format. Yours, Steffen

Tomnl commented 5 years ago

Hi @sneumann and massbank-data contributors,

Happy to help where I can for this.

If you are interested in the database schema for the library see here (If desired, I can update to add new columns, change names, etc)

berlinguyinca commented 5 years ago

thanks!

On Thu, Nov 29, 2018, 6:56 AM Thomas N Lawson <notifications@github.com wrote:

Hi @sneumann https://github.com/sneumann and massbank-data contributors,

Happy to help where I can for this.

If you are interested in the database schema for the library see here https://bioconductor.org/packages/devel/bioc/vignettes/msPurity/inst/doc/msPurity-spectral-matching-vignette.html#4_library_database_schema (If desired, I can update to add new columns, change names, etc)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MassBank/MassBank-data/issues/32#issuecomment-442862642, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA_7LFwRjlXbh1bFBoYsYjDeGlAezzfks5uz_WBgaJpZM4Y5uN3 .

sneumann commented 5 years ago

And here is the start of a one-liner to convert, so far without -v volumes and stuff:

docker run --rm -it ubuntu:18.04 sh -c 'apt update ; apt install -y python-pip git ; git clone https://github.com/MassBank/MassBank-data; pip install msp2db; msp2db -msp_pth MassBank-data -name MassBank -source massbank -o /tmp'
Tomnl commented 5 years ago

Just updated the one-liner to ensure the correct msp regular expression are used with msp2db (-schema massbank)

docker run --rm -it ubuntu:18.04 sh -c 'apt update ; apt install -y python-pip git ; git clone https://github.com/MassBank/MassBank-data; pip install msp2db; msp2db -msp_pth MassBank-data -name MassBank -source massbank -schema massbank -o /tmp'
Tomnl commented 5 years ago

Hi all,

I have added the SQLite database of MassBank to the assets of the github release of msp2db. See file massbank_12122018.db

Also, updated the command line to calls to be a bit cleaner

docker run --rm -it ubuntu:18.04 sh -c 'apt update ; apt install -y python-pip git ; git clone https://github.com/MassBank/MassBank-data; pip install msp2db; msp2db --msp_pth MassBank-data --source massbank --schema massbank --out_pth /tmp/massbank.db'

The release also includes a SQLite database of representation of the MoNA MSP files

I can continue maintaining the databases on the msp2db github for now but happy to change if we find a better location to store the database files.

tsufz commented 5 years ago

Hi, This is very appreciated! Thanks a lot. In future, we plan to store and release derived DBs in different formats (NIST, SQLite, etc.) at MassBank-data, But at the moment it is great deal to keep it in your repository. Thanks!

@meier-rene, @sneumann and @schymane, we may add some external links for DB download in the Readme?

sneumann commented 4 years ago

Hi, I just checked https://github.com/computational-metabolomics/msp2db/releases/ where the SQLite converted MassBank data is included. We need to decide whether we ping msp2db about every release in https://github.com/MassBank/MassBank-data/releases so they can release updated snapshots. Yours, Steffen

sneumann commented 3 years ago

Hi @jorainer , in this issue are a few pointers for the sqlite that is in MSPurity. Would there be a chance that your developments on sqlite cover the uses cases implemented in Birmingham ? Or even recycle parts of that ? Yours, Steffen

jorainer commented 3 years ago

That's a good point @sneumann ! I'll have a look at the MsPurity database layout (maybe you could point me to the info @Tomnl ?). In general, the CompDb database layout is super-simple. I just have tables compound, msms_spectrum and msms_spectrum_peak with only very little constraints to accommodate data from all the various sources.

Tomnl commented 3 years ago

Hi both,

Are you planning on creating a standard MS/MS format for library spectra in SQL?

I think the database for msPurity and msp2db follow a similar structure to CompDb. i.e. three main tables consisting of a compound table, a table for the spectrum peaks (e.g. mz, intensity, etc) and a table for more the spectrum as whole (e.g. precursor mz, fragmentation level, energy etc).

For spectral matching I originally made a different schema for "library" and "query" database. But they are essentially the same basic structure and can be used interchangeably in msPurity. They just have slightly different table names and additional fields for the query spectra. See the "library" database schema code and the more extensive "query" database schema (that includes XCMS mapping as well).

In hindsight I probably should have followed the schema already developed in mzdb... but perhaps that schema was too complex for what was needed

jorainer commented 3 years ago

Thank Thomas @Tomnl for you quick reply!

Actually, I'm not trying to define a standard format - I think that might be too complicated, we will still need different layouts for different purposes. I think it's better if we still allow to have different database layouts, but then maybe a shared interface to them. Here's where the Spectra package comes into play. That package provides basic MS spectra processing and handling functionality, but, more importantly, allows to use different backends to represent or provide the data. For the user it does thus not matter from where the data comes (see also here for a short tutorial illustrating that). I'm currently implementing e.g. a backend for MassBank that one could directly access spectra data MassBank. Btw - maybe you would like to contribute some of the functionality from msPurity to Spectra? Or add some functionality you found useful in the processing of MS2 data?

The CompDb layout should just facilitate sharing of e.g. public spectra (and compound) databases via e.g. Bioconductor's AnnotationHub (where already genome and genetic annotations are shared).

tsufz commented 3 years ago

I think, we solved that with the SQL export.