Open jorainer opened 3 years ago
You are right. Every RECORD creates a entry in COMPOUND and this is a conceptional mistake but it has very low priority to fix this on my side. It has somehow historical origin which I dont know.
Yes, I think this an back log issues from previous development. The compound name is not yet standardised and thus each records maybe unique in terms of "any name". We were already thinking about a preferred MassBank name which is then shown in the records and the website. However as @meier-rene said, this has not too high priority because is needs some (painful) curation. I have it somewhere down on my to do list, but @schymane has maybe also an opinion. We may have an issue open...
Thanks for the feedback @tsufz , good to know that it is on your radar. I only came across this while writing a doc on how to create a CompoundDb resource from MassBank
.
Just to quickly explain, the point of CompoundDb
is that it allows to create self-contained SQLite annotation resources for e.g. metabolites. The database layout is on purpose very simple (just a compound and msms_spectrum database table) and flexible to support a variety of input sources. We can so far create databases from HMDB, MoNa, ChEBI, ... and also MassBank. The CompDb
databases work with or without redundancies in the compound table - it's then just that redundant results are shown to the user.
Also to explain why I thought it might be interesting to create a CompDb
from/for MassBank: CompDb
databases provide also a Spectra
interface, thus this would be an alternative approach to have access to MassBank annotations from R:
MsBackend
for MassBank (e.g. using the new REST interface).CompDb
annotation resource.The reason I like the second approach is that it allows users to have the data all local and that this would allow reproducible analyses, because the annotation resource will get/have a version. MassBank would be ideal for this because you define already releases. We could then simply build one CompDb
for each MassBank release and distribute them via Bioconductor's AnnotationHub
. It is pretty straight forward to query this AnnotationHub
from R and get the annotation for a certain version for the analysis. This works already exceptionally well for genomic annotations: I am building self-contained small annotation databases with gene, transcript, exon and protein annotations from Ensembl and provide them on AnnotationHub
for each release of Ensembl. These are then usually used in RNA-seq analyses.
Sorry for this very long explanation - I just wanted you to get the full picture :)
I was just wondering if there are any plans to normalize the database, i.e. to reduce the redundancy in the
COMPOUND
table @sneumann @tsufz @meier-rene ? As far as I've seen there is one entry in theCOMPOUND
table for each entry in theRECORD
table (i.e. the spectra).