MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
14 stars 22 forks source link

Curating CH$NAME entries #81

Open schymane opened 7 years ago

schymane commented 7 years ago

Hi all,

We now have many records from many contributors that are the same substance, but with different CH$NAME entries (and different combinations of CH$NAME entries). Since RMassBank starts from the Compound List, if the starting name is different, these CH$NAME fields are not even consistent between RMassBank records ... but depends on when the compound data was retrieved etc. How should we go about fixing this? In terms of MassBank display, the title etc, the FIRST CH$NAME entry is critical. Ideally this should be the same for a unique compound across all contributors - but how do we choose which one is "right" and which CH$NAME entries to keep and which to discard? Which entry should be the FIRST CH$NAME entry?

Some random examples: "Imidacloprid urea" and "1-[(6-chloropyridin-3-yl)methyl]imidazolidin-2-one" "Imidacloprid-urea" and "1-[(6-Chloropyridin-3-yl)methyl]imidazolidin-2-one" "Imidacloprid-urea" and "CHEMBL71188" and "1-[(6-chloropyridin-3-yl)methyl]imidazolidin-2-one"

"2-Isopropyl-6-methyl-pyrimidin-4-ol" and "6-methyl-2-propan-2-yl-1H-pyrimidin-4-one" "2-Isopropyl-6-methyl-pyrimidin-4-ol" and "6-Methyl-2-propan-2-yl-1H-pyrimidin-4-one" "Pyrimidinol" and "2-isopropyl-6-methyl-1H-pyrimidin-4-one" "Pyrimidinol" and "2-Isopropyl-6-methyl-pyrimidin-4-ol" and "6-Methyl-2-propan-2-yl-1H-pyrimidin-4-one"

or the various possibilities (mixed and matched) for Lidocaine: LID_235.1805_10.1 Lidocaine 2-(diethylamino)-N-(2,6-dimethylphenyl)acetamide Lidocain 2-(Diethylamino)-N-(2,6-dimethylphenyl)acetamide Lignocaine

One case would be to just choose the first of the first CH$NAME entries processed for a given compound, but this is somewhat random, only partially reproducible and in the case of LID_235.1805_10.1 would result in a very strange CH$NAME entry as the primary name. Another choice would be the "Preferred name" from the CompTox Dashboard - which will be fine for the curated MassBank.EU entries we have done ... but will not hold/be possible for all MassBank or (necessarily) for new records until they are registered - and these are also not always perfect. It would also remove the preferred "primary name" for the contributing institute (i.e. the entry from the compound list), which is something that some people use a lot.

Any thoughts? @meowcat @tsufz @ChemConnector

@meowcat can you remember why we chose only 3 CH$NAME entries, was this for our sanity within RMassBank and not because of a numerical restriction (I saw no restriction in the Record Specs?). Will it be a problem if I "curate" our records to have potentially MORE than 3 names? Will RMassBank be able to deal with this when we re-parse records (and if not can we help it deal with it?). Or do I have to stick with 3 names?

The current search functionality (by name) seems to work for any entry in a CH$NAME field, so this should not be an issue. Will this remain so in the future?

Finally ... does anyone have a sensible idea how we could store and access these names (and related identifiers) and make this future proof - so we can ensure that new contributions are named consistently? Would we be able to access that within RMassBank to check not just the "infolist" entries locally but also to check those already ON MassBank? I.e. have a centralized "infolist"?

Thanks!

tsufz commented 7 years ago

Will discuss the topic with Martin, my opinion is too humble.

tsufz commented 7 years ago

However, the curation of a central list on MassBank should be easier in future. Once, the new DB is available, we have many opportunities to use it. I guess after re-factorisation of the DB, we will also hand on the API? This could be the access point for RMassBank (and other software).

tsufz commented 7 years ago

We suggest a central mapping table where a preferred name is automatically set and related to all existing names with manual curation of missmatches and hence only the curated name is used. The original records stay untouched. Such like:

Preferred name; collection of names AAA; aaa, AaA, aAa, aaA, AAa, Aaa

uchem-massbank commented 7 years ago

We would need another identifier (or two) to link too, because names are not always unique (especially abbreviations) – InChIKey? Plus a back-up identifier for those where InChIKeys aren’t defined? Keeping the original records untouched sounds like a neat way around it – you will map on the DB side so the records only appear under the “preferred name” in record listings? Who is going to coordinate collecting the synonyms? I have a whole lot of additional info from CompTox to add that has not yet made it into any (live) records … Should the preferred name also go into the collection of names as well? AAA; aaa, AaA, aAa, aaA, AAa, AAA, Aaa

tsufz commented 7 years ago

Of course, we need also the structural identifier. The list should be collected automatically from the names available in all records. The preferred name could be retrieved from a reliable source, but needs the final approval from curators. This is tedious, but necessary work in my opinion.

However, I am not known as a friend of the neat way around the (auto-)curation of records also on the record file basis. The goal is to get finally rid of work around situation such as the mapping table.

We are back on the curation discussion #25. I suggest to start strict curation of all records which are marked respectively. All spectra expect the Waters are marked with CC-BY license and hence it is possible to curate them. We did already by injection of SPLASH and we should do with other stuff. It is annoying for the users to get different names of one compound or misslinks etc.

Curation will improve reliability in MassBank, starting with the names would a great step forward.

And finally, the preferred name is part of the collection of names.

ChemConnector commented 7 years ago

@schymane I think it's a good idea to look at the CompTox Chemistry Dashboard as a source of "Preferred Names". Certainly they will not always be perfectly matched for this purpose but in the vast majority of cases they will be appropriate. In arranging for mappings to the dashboard we could also coordinate around preferred name assignment. The associated synonyms are always available (for data that is public) so if a particular synonym was preferred over our assigned Preferred Name we could discuss. These designs are particularly subjective in nature after all. Looking forward to helping with this aspect of the project as required.

tsufz commented 7 years ago

@ChemConnector great, txs for your help. This makes it easier to sort out lazy name tags and to improve the things. @Treutler and @naperone, should be considered in the development of new DB structure in #9

schymane commented 7 years ago

@ChemConnector @meowcat and I discussed this today independently. @ChemConnector and I have a partial curation of all contributions that came via MassBank.EU (direct upload by @tsufz or @schymane) – these have all been registered in the Dashboard and I have a preferred name and IUPAC name entry from the Dashboard that I can add if they do not exist already. @meowcat and I agreed the best way would be to just add an additional CH$NAME field if needed and not remove/change any existing ones unless absolutely necessary (safest approach, the 3-CH$NAME field entry is a restriction from RMassBank only). I will only CHANGE the FIRST CH$NAME entry for approx. 5 records across all libraries we processed where we found this was totally and obviously the wrong name. Some structural information, CAS numbers, masses and other identifiers will also change for other records – beyond these 5 records. We’ve done a pretty detailed look through and have checked errors with the original sources to clarify where necessary. I need/want to do these updates as part of our efforts to add in the DTXSIDs into all our MassBank.EU records so we can work towards integration in the Dashboard, which will be live hopefully mid August. We’re coordinating with an eye on automating this for the future, but we’re not there yet. It’s been a lot of manual work amongst the automated work and many hours have gone into this.

I do not really want to start doing a batch upgrade of making the names consistent between institutes, I feel this is beyond the current scope – likewise it would be quite impractical if any auto-curation started before I can get this partial curation/enhancement done – because I have everything cross-linked at the current state.

Beyond this partial update … it is not feasible to ask @ChemConnector to register ALL compounds in the entire MassBank including all the old records (approx. 6,000 compounds not yet registered) just to obtain preferred names. This is a huge time commitment for a very small gain in the relative scale of things. We discussed whether we can continue our curation bit by bit, database by database (we have a relatively efficient workflow going now) and can especially look at this for new contributions assuming the traffic stays reasonable. We can also look at flagging issues in old records and updating Names and DTXSIDs etc with the same workflow we’ve been using and upgrading where we get clear matches – but resolving all issues will be extremely time consuming and beyond our capacity. We should discuss auto-curation efforts (#25) in a way that we do not rely on registration in the Dashboard, but can benefit from their expertise.

To make things future-proof, we will need to find a good rule to source preferred names for future contributions to avoid the current issues. Something like a mapping table with identifiers that can be checked along with submission could work (especially if this does not require changes to the original records?!) – one could also consider a CH$PREFERRED_NAME (or similar) field that we could add for indexing, to avoid overwriting NAME entries in original records and rather just add to them (like e.g. the SPLASH – this was just an addition not an actual manipulation). I do not really want to see individual MassBank records filled with several to hundreds of synonyms, I don’t see us as a synonym collector and would rather leave that to external databases – it will crowd the records with the current display techniques and add to the distance to the peak list.

My thoughts (and discussion summary) anyway.

m-arita commented 7 years ago

Dear all,

Cheminfo researchers repeat the issue of standard names again and again. Adding a new name to CH$NAME is fine but I do not agree on overwriting even under the CC-BY license. It requires authors' permission since MassBank says it is user-contributed. For this reason, the JP side did not touch records and instead created a curated version independently (and the curated version is provided from MoNA in MSP format).

Let's not touch CH$NAMES but instead use the first 14 characters of INCHIKEY. This ignores stereo information but it suffices for MassBank. Much easier and causes no conflict with original authors.

Best wishes,

Masanori Arita

2017-07-27 4:08 GMT+09:00 schymane notifications@github.com:

@ChemConnector @meowcat and I discussed this today independently. @ChemConnector and I have a partial curation of all contributions that came via MassBank.EU (direct upload by @tsufz or @schymane) – these have all been registered in the Dashboard and I have a preferred name and IUPAC name entry from the Dashboard that I can add if they do not exist already. @meowcat and I agreed the best way would be to just add an additional CH$NAME field if needed and not remove/change any existing ones unless absolutely necessary (safest approach, the 3-CH$NAME field entry is a restriction from RMassBank only). I will only CHANGE the FIRST CH$NAME entry for approx. 5 records across all libraries we processed where we found this was totally and obviously the wrong name. Some structural information, CAS numbers, masses and other identifiers will also change for other records – beyond these 5 records. We’ve done a pretty detailed look through and have checked errors with the original sources to clarify where necessary. I need/want to do these updates as part of our efforts to add in the DTXSIDs into all our MassBank.EU records so we can work towards integration in the Dashboard, which will be live hopefully mid August. We’re coordinating with an eye on automating this for the future, but we’re not there yet. It’s been a lot of manual work amongst the automated work and many hours have gone into this.

I do not really want to start doing a batch upgrade of making the names consistent between institutes, I feel this is beyond the current scope – likewise it would be quite impractical if any auto-curation started before I can get this partial curation/enhancement done – because I have everything cross-linked at the current state.

Beyond this partial update … it is not feasible to ask @ChemConnector to register ALL compounds in the entire MassBank including all the old records (approx. 6,000 compounds not yet registered) just to obtain preferred names. This is a huge time commitment for a very small gain in the relative scale of things. We discussed whether we can continue our curation bit by bit, database by database (we have a relatively efficient workflow going now) and can especially look at this for new contributions assuming the traffic stays reasonable. We can also look at flagging issues in old records and updating Names and DTXSIDs etc with the same workflow we’ve been using and upgrading where we get clear matches – but resolving all issues will be extremely time consuming and beyond our capacity. We should discuss auto-curation efforts (#25) in a way that we do not rely on registration in the Dashboard, but can benefit from their expertise.

To make things future-proof, we will need to find a good rule to source preferred names for future contributions to avoid the current issues. Something like a mapping table with identifiers that can be checked along with submission could work (especially if this does not require changes to the original records?!) – one could also consider a CH$PREFERRED_NAME (or similar) field that we could add for indexing, to avoid overwriting NAME entries in original records and rather just add to them (like e.g. the SPLASH – this was just an addition not an actual manipulation). I do not really want to see individual MassBank records filled with several to hundreds of synonyms, I don’t see us as a synonym collector and would rather leave that to external databases – it will crowd the records with the current display techniques and add to the distance to the peak list.

My thoughts (and discussion summary) anyway.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MassBank/MassBank-web/issues/81#issuecomment-318152478, or mute the thread https://github.com/notifications/unsubscribe-auth/AIFQDXAjHKOFksGG2iZyH6Bz6Pd0F3UDks5sR46igaJpZM4Oib79 .

-- 有田正規 (Masanori Arita)

tsufz commented 5 years ago

WIth reference to #156, I would like to come back to the discussion on curation of meta data. The issue of @schymane is a very good example that a curation of meta data is required, especially the harmonisation of the presented name. I guess, it is quite annoying to people scrolling through a list with redundant entries because of different name presentations.

Best Tobias

schymane commented 5 years ago

I am not sure whether curating CH$NAME entries is the way to go – rather a synonym database and a consistent presentation of names for the searches may be a better option – so that all synonyms used are present (and added as new ones come in) and single chemical entries are returned and displayed consistently. The best option will likely depend on the code base …

tsufz commented 5 years ago

Jupp, see my comment above (by 25 Jul 2017!)

meier-rene commented 5 years ago

To solve #156 it is not necessarily needed to finally decide about this issue, but it wouldn't hurt. My feeling of the discussion is that we should not substitute existing names with curated ones, but I don't see any problems in adding new names or adding new fields to the MassBank scheme.

For #156 we need a unique key for grouping and i propose to use InChi or a subset of InChi like the first field of the InChi-Key as allready discussed above. This is reasonable now because we have added all InChi if there was at least a SMILES available. There is only a small fraction of records left without InChi/SMILES and for them we need to fall back to grouping by name.

The other question is how should the respective group of records be named. Easiest would be first in list. More reasonable maybe most occurring? And of course we can additionally provide a curated list with names for particular compounds, but I doubt that this can be complete. Maybe it could cover the most occurring compounds. Nevertheless, I wouldn't push this idea.

My suggestion: What about adding a uniform synonym for every compound as last CH$NAME field. This can be done by an algorithm (the source of the synonym is still an open question) and does not break existing MassBank format.

schymane commented 5 years ago

To solve #156 it is not necessarily needed to finally decide about this issue, but it wouldn't hurt. My feeling of the discussion is that we should not substitute existing names with curated ones, but I don't see any problems in adding new names or adding new fields to the MassBank scheme.

I agree here, we should not replace any information (unless it is blatantly wrong of course) but just add extra fields if we need, as you suggest below. I would not like to see 100s of synonym entries added to records.

For #156 we need a unique key for grouping and i propose to use InChi or a subset of InChi like the first field of the InChi-Key as allready discussed above. This is reasonable now because we have added all InChi if there was at least a SMILES available. There is only a small fraction of records left without InChi/SMILES and for them we need to fall back to grouping by name.

Again I agree - I think InChIKey or InChIKey first block would be the appropriate way to go; default to grouping by name for those with missing entries. I am caught between grouping by first block or not; has advantages and disadvantages.

The other question is how should the respective group of records be named. Easiest would be first in list. More reasonable maybe most occurring? And of course we can additionally provide a curated list with names for particular compounds, but I doubt that this can be complete. Maybe it could cover the most occurring compounds. Nevertheless, I wouldn't push this idea.

Display name: First in list would be a random choice; the "most occurring" likely a better option that I would prefer and should be easy enough to manage?

My suggestion: What about adding a uniform synonym for every compound as last CH$NAME field. This can be done by an algorithm (the source of the synonym is still an open question) and does not break existing MassBank format.

OK in principle, but what if the name is repeated (maybe not a problem, just aesthetically displeasing to see unnecessary repetiton). Alternative is to introduce a second field CH$DISPNAME or CH$PREFNAME (also not ideal). My thoughts at least...

sneumann commented 5 years ago

CTS had a way to guess the "best" name, using some scoring. Not sure if they still do.

What happens to InChIkey grouping for Emma's tentatives where we are not entirely certain about the structure ?

schymane commented 5 years ago

The CTS scoring has not worked that way for many years and they have not been able to reinstate it to work as we used to use it … we will have to do a synonym count our side I guess, which should be simple enough (?).

Re tentatives: if no InChIKey, group by (first?) name. They all have name entries, and they are all distinct for different cases …

Treutler commented 5 years ago

Regarding the display of groups of searched records I vote for the usage of the full InChI/InChIKey as the grouping criterion. Than we can safely display a compound name and a structure. If we group isomeric structures both the compound name and the structure will be wrong for a subset of the respective group. For the selection of the displayed name the usage of the most common name or the shortest name are possible ways to go but this is not really critical in my eyes.

schymane commented 5 years ago

I agree with your comments to use the full InChI(Key) and would suggest to use the most common name as default; sometimes the shortest ones are also non-unique (same abbreviation for totally different names) or may also be some of the ugly identifier name synonyms that slipped through when CTS changed their name scoring system. It’s also possible we have the same synonyms associated with multiple stereoisomers as a factor of the mixed contributors …

sneumann commented 5 years ago

Hi, I would like to suggest DISPLAYNAME, as people might take 'preferred' personally, if their favorite is not the preferred one. Yours Steffen

schymane commented 5 years ago

I'd vote for CH$DISPLAY_NAME or similar as well!

tsufz commented 5 years ago

I agree with CH$DISPLAY_NAME or similar.