Any plans to normalize the database?

jorainer commented 3 years ago

I was just wondering if there are any plans to normalize the database, i.e. to reduce the redundancy in the COMPOUND table @sneumann @tsufz @meier-rene ? As far as I've seen there is one entry in the COMPOUND table for each entry in the RECORD table (i.e. the spectra).

meier-rene commented 3 years ago

You are right. Every RECORD creates a entry in COMPOUND and this is a conceptional mistake but it has very low priority to fix this on my side. It has somehow historical origin which I dont know.

tsufz commented 3 years ago

Yes, I think this an back log issues from previous development. The compound name is not yet standardised and thus each records maybe unique in terms of "any name". We were already thinking about a preferred MassBank name which is then shown in the records and the website. However as @meier-rene said, this has not too high priority because is needs some (painful) curation. I have it somewhere down on my to do list, but @schymane has maybe also an opinion. We may have an issue open...

jorainer commented 3 years ago

Thanks for the feedback @tsufz , good to know that it is on your radar. I only came across this while writing a doc on how to create a CompoundDb resource from MassBank.

Just to quickly explain, the point of CompoundDb is that it allows to create self-contained SQLite annotation resources for e.g. metabolites. The database layout is on purpose very simple (just a compound and msms_spectrum database table) and flexible to support a variety of input sources. We can so far create databases from HMDB, MoNa, ChEBI, ... and also MassBank. The CompDb databases work with or without redundancies in the compound table - it's then just that redundant results are shown to the user.

Also to explain why I thought it might be interesting to create a CompDb from/for MassBank: CompDb databases provide also a Spectra interface, thus this would be an alternative approach to have access to MassBank annotations from R:

online access via the MsBackend for MassBank (e.g. using the new REST interface).
offline/local access via a CompDb annotation resource.

The reason I like the second approach is that it allows users to have the data all local and that this would allow reproducible analyses, because the annotation resource will get/have a version. MassBank would be ideal for this because you define already releases. We could then simply build one CompDb for each MassBank release and distribute them via Bioconductor's AnnotationHub. It is pretty straight forward to query this AnnotationHub from R and get the annotation for a certain version for the analysis. This works already exceptionally well for genomic annotations: I am building self-contained small annotation databases with gene, transcript, exon and protein annotations from Ensembl and provide them on AnnotationHub for each release of Ensembl. These are then usually used in RNA-seq analyses.

Sorry for this very long explanation - I just wanted you to get the full picture :)

MassBank / MassBank-web

Any plans to normalize the database? #266