MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
13 stars 22 forks source link

Ideas for Validator #158

Open meier-rene opened 5 years ago

meier-rene commented 5 years ago

I want collect Ideas for automatic Validation in this Issue: -check for duplicate entries in CH$NAME implemented and applied, 242 records fixed -check for InChIKey-style pattern in CH$NAME implemented and applied, 131 records fixed -perhaps flag for super-short names that contain letters and numbers (as these are e.g. database codes, like CID1233 or something, a lot of ZINC and CHEBI sneak through

Treutler commented 5 years ago

I stumbled over this spectrum and noticed the fragment m/z's compared to the parent-ion-mz (423.5989 and 782.2591 vs 147.0441). If the validator could reveal fragments which are too heavy, than we would at least be aware of that.

schymane commented 5 years ago

That spectrum looks like it is only noise (those are the only two peaks). Note that spectra processed with RMassBank can sometimes contain heavier peaks if they have certain adducts (up to +N2O allowed), so this should be considered in any validation. There are, however, many spectra with bogus heavy peaks that are clearly just noise (where it is clear from mass defect etc, like in this case) and maybe the (sub)formula assignment routine in RMB could be integrated into the validator to help separate the possible goodies from the baddies?

What is going to be the procedure for spectra that the validator identifies as (likely) pure noise, like the example you just raised?

meier-rene commented 5 years ago

I have no idea for the proper procedure, especially in this case, because its experimental data, not meta data. I wouldn't touch it. One could flag it or raise an issue with the original contributor, but sometimes this will be complicated.

Treutler commented 5 years ago

Check CH$NAME for

schymane commented 5 years ago

In light of issues found/raised by Herbert Oberacher recently, I see a couple of new ideas we should consider implementing in the validator:

schymane commented 5 years ago
Treutler commented 5 years ago

Please check whether all fragment-m/z in the PK$ANNOTATION section are present in the PK$PEAK section

schymane commented 5 years ago

Good idea, I suggest to build in a slight tolerance to avoid decimal place issues. I wouldn't check on the reverse, i.e. there may be fewer PK$ANNOTATION entries than PK$PEAK but there should not be more PK$ANNOTATION entries than PK$PEAK (unless anyone puts out multiple annotations for a given PK$PEAK, I am not aware of this case ... RMassBank only puts out one formula per peak and tags if more were possible ...

meowcat commented 5 years ago

unless anyone puts out multiple annotations for a given PK$PEAK, I am not aware of this case ...

  • I am not 100% certain that we don't have cases like this from RMassBank - the s4power branch is certainly able to produce such records if you tell it to.
  • I personally don't think a record should be invalid if multiple annotations are present for a peak. Note that the annotation field is loosely defined in what it is allowed to contain, so this is certainly legal and possibly also welcome in some cases...
schymane commented 5 years ago

Agree with @meowcat - in principle no problem with having multiple annotations for one peak

schymane commented 5 years ago

We should add a validator check that screens whether there are spectra with identical SPLASHes but conflicting compound information. I have just reported several cases of this in MassBank-data - it would be great to screen whether any more cases exist so we can amend as required, and add this as a general check to avoid this happening in the future. It is very hard for us to catch this on the RMassBank side if people do not do the manual checks (but this is something we must consider how to validate on the data processing side too).

Treutler commented 5 years ago

Please check the identity of the three structure identifiers InChIKey, SMILES, and InChI

schymane commented 5 years ago

...and once these (InChI, SMILES and InChIKey) are consistent within another, we need to check that the related database identifiers match by InChIKey...and either update or remove incorrect ones.