MassBank / MassBank-data

Official repository of open data MassBank records
68 stars 55 forks source link

Have GitHub and Zenodo releases synchronized #238

Open Adafede opened 10 months ago

Adafede commented 10 months ago

Hi,

Thank your for all your effort put in MassBank! I was trying to access its data and realized https://github.com/MassBank/MassBank-data/releases and https://doi.org/10.5281/zenodo.3378723 are not synchrone.

This can be easily done by following https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content.

This way, each GitHub release ends up archived on Zenodo and having its DOI automatically.

Hope this makes sense!

meier-rene commented 10 months ago

Thank you for bringing this to our attention. An automatic procedure should be in place, but apparently its not working atm. I will look into this.

meier-rene commented 10 months ago

I just checked and didn't found any differences. Could you please explain a little bit more of your finding? What I did:

Adafede commented 10 months ago

Wow, this is a fast reply!

I actually found the different json/sql/msp files available in the releases/tag/2023.06 very convenient and they do not seem to appear on Zenodo, but maybe I missed something?

P.S.: Is there any reason for having an sql and no sqlite which would make it directly readable by MsBackendMassbank? (Or did I miss something again here?)

meier-rene commented 10 months ago

Yes, you are right. Zenodo only covers the txt files. Thats a result of the automatic zenodo release procedure of github. I dont know how to automatically attach the other release artifacts to the zenodo release.

For your second question I have no answer atm. The sql file is released for the MsBackendMassbank package, but we did not put too much effort into it. Its basically the dump of our internal data structure. Maybe this sql file needs to be processed to an sqlite file? I need to do some research. Maybe @jorainer didnt want to create additional workload on our side? I found that script: https://github.com/rformassspectrometry/MsBackendMassbank/blob/main/inst/scripts/massbank-to-sqlite.R. If thats the case we can probably modify our scripts to create the sqlite artifact instead of the sql file.

Adafede commented 10 months ago

👍🏼 The different "ready-to-use" files would be a plus on Zenodo (I also don't know how to attach artifacts to Zenodo releases automatically...will search a bit and come back if I find something). I was also using the nice script of @jorainer, and we are probably many out there to do so...so generating the sqlite directly would probably indeed add some work on your side, but avoid it being replicated many times elsewhere.

jorainer commented 10 months ago

Note: my preferred way to access/use MassBank data in R is through AnnotationHub:

library(AnnotationHub)
ah <- AnnotationHub()
query(ah, "MassBank")
AnnotationHub with 3 records
# snapshotDate(): 2023-06-23
# $dataprovider: MassBank
# $species: NA
# $rdataclass: CompDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH107048"]]' 

             title                                
  AH107048 | MassBank CompDb for release 2021.03  
  AH107049 | MassBank CompDb for release 2022.06  
  AH111334 | MassBank CompDb for release 2022.12.1

So, as for now there are these 3 releases available through AnnotationHub. To use one of them:

mb <- ah[["AH107049"]]
mb
class: CompDb 
 data source: MassBank 
 version: 2022.06 
 organism: NA 
 compound count: 90190 
 MS/MS spectra count: 90190 

This CompDb can be used directly with Spectra (i.e. Spectra(mb) would get you all MS2 spectra). Besides being available through AnnotationHub, the resource (sqlite file) gets also locally cached. So, first time downloaded, and any subsequent use will load it from the local cache.

There's however a manual step involved - since I need to convert the MassBank data structures into a CompDb SQLite (using this script) and then also to upload and maintain these releases in Bioconductor's AnnotationHub... but I think that this should simplify usage of MassBank in R tremendously. Long term goal is to provide also other annotation resources (as CompDb?) through AnnotationHub...