MassBank / MassBank-data

Official repository of open data MassBank records
74 stars 59 forks source link

Update NaToxAq #49

Closed tsufz closed 5 years ago

meier-rene commented 5 years ago

Hi, the validator rejects '\~' characters in CH$NAME, e.g. '(1~{S},2~{R},9~{S},10~{S})-7,15-diazatetracyclo[7.7.1.0^{2,7}.0^{10,15}]heptadecane'. Is there a reason for having '(1~{S},2~{R},9~{S},10~{S})' vs '(1S,2R,9S,10S)'?

schymane commented 5 years ago

The 1S,2R,9S,10S is more correct, what you have seen (9~{S} etc) seems rather the result of incorrect/inappropriate handling of name text formatting conversions when going from one database to another somewhere down the line and it should, indeed, fail the validator as it’s an inherited computer error not an appropriate naming scheme. There are a lot of these around (e.g. conversion errors surrounding alpha, beta, ortho, para, R, S, D, L etc etc) and we may need to develop some rules to catch them in your validator (or just collect them as we find them …).

meier-rene commented 5 years ago

You are right. These characters are from conversions of formatted text to plain text. I have seen a lot of these things and other funny stuff and I have already fixed plenty of this. One of my favorites is the ® sign. For now the validator flags all characters which do not belong to the name of a chemical compound to my knowledge. Thats why I was asking if someone knows a rule were the '~' is needed. I would rather not implement automatic conversions here, because automatic conversions of leftovers from automatic conversions will probably create other problems.

In this case I will remove these conversion leftovers with some command line magic and we will see if other issues remain. @tsufz please take no action atm.

tsufz commented 5 years ago

Okay, I am not keen to re-process everything again. This is quite tedious with all the different energies...

schymane commented 5 years ago

It is a good idea to check the infolists created by RMassBank for these naming issues in advance, this is why we print out that csv for checking. Then you only need to correct it once, and it is fixed in all subsequent records. This was our whole reason for creating this system of infolist retrieval and archiving. It is especially critical now since CTS disabled the scoring of the names and many more random names with text issues come through … we can only do so much our side but the users do need to check the exported data at two critical steps (infolists and failpeaks) to ensure quality!

tsufz commented 5 years ago

@schymane, I fully agree. I was going to remove / edit those names. But @meier-rene asked me to take no action so far...

schymane commented 5 years ago

You need to do this name checking in the infolist during processing …i.e. when the infolist is created – before you even write out the records … once the records are created it is too late and you’d have to redo the entire mbWorkflow again for all those records … (which is why the choice of Rene to fix “his” side was really the most practical for this case). So please remember to check your infolist carefully when creating future records ;-)

meier-rene commented 5 years ago

@tsufz doing this string manipulation just took 10min. Nothing to worry about. This PR was merged slightly changed as PR #50. Thanks all!

schymane commented 5 years ago

I’m adding comments to #50. There are still issues. InChIKeys and duplicate names have snuck through… I suggest you (@meier-rene) put a validation check for an InChIKey-style pattern in the name validation checks, and also perhaps a flag for super-short names that contain letters and numbers (as these are e.g. database codes, like CID1233 or something). A lot of ZINC and CHEBI etc ones sneak through if people do not check their infolists (but I have not yet found this in this exact dataset).