MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
14 stars 22 forks source link

Update of MassBank Record format with enhanced compound class information #205

Open meier-rene opened 4 years ago

meier-rene commented 4 years ago

At the moment the CH$COMPOUND_CLASS is a mandatory, but free format field and the content is not very useful for computer algorithm. We would like to include some more useful information there. We would like to include the classes from the ChemOnt ontology and I would like to discuss the best way to include this information. Currently this field is described with:

Category of Chemical Compound. Mandatory Example: CH$COMPOUND_CLASS: Natural Product; Carotenoid; Terpenoid; Lipid Either Natural Product or Non-Natural Product should be precedes the other class names .

The entries are separated by ; which would also work for ChemOnt, because none of the classes contains a ;. To split the individual classes the character / is not available, because its included in some ontology terms. A |would be possible but I would rather prevent this character. A '\' is available and I would prefer this one. Thats why I propose the change of the record format as follows:

2.2.2 CH$COMPOUND_CLASS

Category of Chemical Compound. Mandatory
Either Natural Product or Non-Natural Product should be precedes the other class names . 
ChemOnt ontology terms are supported in the following format 
'ChemOnt[2.1]:Organic compounds[0000000]\Lipids and lipid-like molecules[0000012]\Prenol lipids[0000259]\Tetraterpenoids[0001554]\Carotenoids[0001277]\Carotenes[0001411]'

Example: CH$COMPOUND_CLASS: Natural Product; Carotenoid; Terpenoid; Lipid; ChemOnt[2.1]:Organic compounds[0000000]\Lipids and lipid-like molecules[0000012]\Prenol lipids[0000259]\Tetraterpenoids[0001554]\Carotenoids[0001277]\Carotenes[0001411]

The benefit of this changes are: -compatible change in record format, no database changes required -machine readable information available -if we implement full text search one day, these terms are immediately available to the user

Any concerns? Any change requests?

schymane commented 4 years ago

My biggest issue with this field is the (historical) requirement for "Natural" or "Non-Natural Product" as this is impossible to automate. You will see the vast majority of the records we contributed have this as NA - we had to do this to satisfy upload requirements at the time. Is it time to remove the requirement to state "Non-Natural Product"?

For instance, for us, caffeine is an environmental contaminant, but it is also a natural product. Ditto for nicotine ... and many others. We cannot systematically classify these terms ... but ChemOnt allows alternative systematic classification as you proposed as an alternative.

Re separator, maybe also consider | instead of the slash.

tsufz commented 4 years ago

I would also prefer automatic classification. What's about ClassyFire?

tsufz commented 4 years ago

What is the problem with |?