Add a nickname column to database.txt

leeping commented 10 years ago

This pull request adds an extra column to database.txt consisting of an informal molecule nickname. The specification of the nickname is (almost by definition) not strict. I hoped to achieve a compromise between the following desired but conflicting properties:

1) Unique (i.e. minimize the number of duplicates). 2) Short - the issue with IUPAC names is that some of them became too unwieldy to print out in ForceBalance. 3) Specific - we should get results for this molecule by Googling its nickname.

I use the nickname in ForceBalance simply as a display field. I also display the mobley_ID as well. The ForceBalance printout is a fixed-width table which is easy to read (in my opinion):

#==============================================================================================================================================#
#|                                                     Hydration free energies (kcal/mol)                                                     |#
#|                                     Target: FreeSolv  Type: Hydration_OpenMM  Objective = 1.63924e+03                                      |#
#|  ID                                                Nickname   Reference +- StdErr   Calculated +- StdErr   Calc-Ref     Weight   Residual  |#
#==============================================================================================================================================#
    mobley_1017962                            methyl_hexanoate      -2.490 +-  0.600       -3.778 +-  0.011     -1.288      0.000      0.000
    mobley_1019269                                  butan-1-ol      -4.720 +-  0.600       -3.272 +-  0.016      1.448      1.000      2.096
    mobley_1034539            2,3,3',4,4',5-hexachlorobiphenyl      -3.040 +-  0.100        1.271 +-  0.005      4.311      0.000      0.000
    mobley_1036761                             cyclohexanamine      -4.590 +-  0.600       -3.091 +-  0.029      1.499      0.000      0.000
    mobley_1046331                              phenyl_formate      -3.820 +-  0.600       -7.897 +-  0.018     -4.077      1.000     16.619
    mobley_1075836                           methyl_propanoate      -2.930 +-  0.600       -4.841 +-  0.009     -1.911      1.000      3.651
    mobley_1079207                         1,3-dichlorobenzene      -0.980 +-  0.600       -0.203 +-  0.002      0.777      0.000      0.000
    mobley_1107178                                  iodoethane      -0.740 +-  0.600       -2.617 +-  0.004     -1.877      1.000      3.522
    mobley_1139153                      2,2,4-trimethylpentane       2.890 +-  0.600        3.055 +-  0.003      0.165      1.000      0.027

Thanks,

Lee-Ping

davidlmobley commented 10 years ago

Hi,

You only changed database.txt, not database.pickle? We will need both changed.

Thanks.

On Fri, Oct 10, 2014 at 1:12 PM, Lee-Ping notifications@github.com wrote:

This pull request adds an extra column to database.txt consisting of an informal molecule nickname. The specification of the nickname is (almost by definition) not strict. I hoped to achieve a compromise between the following desired but conflicting properties:

1) Unique (i.e. minimize the number of duplicates). 2) Short - the issue with IUPAC names is that some of them became too unwieldy to print out in ForceBalance. 3) Specific - we should get results for this molecule by Googling its nickname.

I use the nickname in ForceBalance simply as a display field. I also display the mobley_ID as well. The ForceBalance printout is a fixed-width table which is easy to read (in my opinion):

==============================================================================================================================================

| Hydration free energies (kcal/mol) |

| Target: FreeSolv Type: Hydration_OpenMM Objective = 1.63924e+03 |

| ID Nickname Reference +- StdErr Calculated +- StdErr Calc-Ref Weight Residual |

==============================================================================================================================================
mobley_1017962                            methyl_hexanoate      -2.490 +-  0.600       -3.778 +-  0.011     -1.288      0.000      0.000
mobley_1019269                                  butan-1-ol      -4.720 +-  0.600       -3.272 +-  0.016      1.448      1.000      2.096
mobley_1034539            2,3,3',4,4',5-hexachlorobiphenyl      -3.040 +-  0.100        1.271 +-  0.005      4.311      0.000      0.000
mobley_1036761                             cyclohexanamine      -4.590 +-  0.600       -3.091 +-  0.029      1.499      0.000      0.000
mobley_1046331                              phenyl_formate      -3.820 +-  0.600       -7.897 +-  0.018     -4.077      1.000     16.619
mobley_1075836                           methyl_propanoate      -2.930 +-  0.600       -4.841 +-  0.009     -1.911      1.000      3.651
mobley_1079207                         1,3-dichlorobenzene      -0.980 +-  0.600       -0.203 +-  0.002      0.777      0.000      0.000
mobley_1107178                                  iodoethane      -0.740 +-  0.600       -2.617 +-  0.004     -1.877      1.000      3.522
mobley_1139153                      2,2,4-trimethylpentane       2.890 +-  0.600        3.055 +-  0.003      0.165      1.000      0.027
Thanks,

Lee-Ping

You can merge this Pull Request by running

git pull https://github.com/leeping/FreeSolv master

Or view, comment on, or merge it at:

https://github.com/choderalab/FreeSolv/pull/7 Commit Summary

Add molecule nickname column

Added header field

File Changes

M database.txt https://github.com/choderalab/FreeSolv/pull/7/files#diff-0 (1288)

Patch Links:

https://github.com/choderalab/FreeSolv/pull/7.patch

https://github.com/choderalab/FreeSolv/pull/7.diff

— Reply to this email directly or view it on GitHub https://github.com/choderalab/FreeSolv/pull/7.

David Mobley Associate Professor Department of Pharmaceutical Sciences Department of Chemistry 3134B Natural Sciences I University of California, Irvine Irvine, CA 92697 dmobley@uci.edu work (949) 824-6383 cell (949) 385-2436

leeping commented 10 years ago

Okay, the nicknames are added to the pickle. It's only 10% bigger than the original pickle so I think the compression format was right.

davidlmobley commented 10 years ago

Hi, Lee-Ping,

I just merged this. But - is there any way your procedure for generating nicknames can be codified? (And if so can you provide the code?) The vision is that we make the whole of the database re-generatable from primary data (experimental data, literature references, canonical isomeric SMILES, notes, and POSSIBLY IUPAC name — the last only if there are still molecules which cannot be named by OEIUPAC). If there is additional data which has to be generated by hand, this may be a problem.

If you do not have a way to codify it yet - I wonder if there is an easy way to get short nicknames via the cross link I currently have to pubchem?

If there is no good way to generate nicknames, this actually may be an argument to exclude them from the database itself. Manual generation of nicknames is not an extensible procedure. :)

Thanks.

David Mobley dmobley@gmail.com 949-385-2436

On Fri, Oct 10, 2014 at 2:05 PM, Lee-Ping notifications@github.com wrote:

Okay, the nicknames are added to the pickle. It's only 10% bigger than the original pickle so I think the compression format was right.

Reply to this email directly or view it on GitHub: https://github.com/choderalab/FreeSolv/pull/7#issuecomment-58716191

jchodera commented 10 years ago

How would you derive nicknames for new molecules added to the set? Is there a unique, programmatic way to compute them?

davidlmobley commented 10 years ago

I thought that was what I asked. :)

David Mobley dmobley@gmail.com 949-385-2436

On Thu, Oct 23, 2014 at 5:08 PM, John Chodera notifications@github.com wrote:

How would you derive nicknames for new molecules added to the set? Is there

a unique, programmatic way to compute them?

Reply to this email directly or view it on GitHub: https://github.com/choderalab/FreeSolv/pull/7#issuecomment-60328191

leeping commented 10 years ago

I generated the nicknames by hand but it was a few weeks ago. Basically I did a Google search on every molecule with a long IUPAC name (say, >40 letters). I then tried to pick a name that was brief, legible, unique and searchable. I used the trade name when available since that usually satisfies these criteria.

My personal motivation for going through the database by hand was to get an "appreciation" for the molecules in the database, since it's chemically quite interesting - we've got everything from sugars to drugs to refrigerants and pesticides. :)

It might be possible to generate these names automatically, but since "informal style" is a central purpose of these nicknames, I don't think we should require the automatically-generated name to be the only acceptable one. In terms of extensibility, we can simply make the IUPAC name the default nickname - but since the requirements are loose, anyone with a good reason can feel free to change it.

Regarding including / excluding this column from the database, I'm fine with it either way. I personally find the nicknames to be quite useful as long as we don't set strict rules for ourselves, and it's up to us (collectively) whether this is helpful for others.

davidlmobley commented 10 years ago

I have two questions on this before I give a final opinion:

1) Do you have any way to ensure the unique name uniquely and completely specifies the molecule, i.e. there is a 1:1 correspondence name:molecule? For example, common names often are ambiguous, failing to specify stereochemistry or other important aspects (take “glucose” for example). Or in other words, by “unique”, do you mean a name which uniquely specifies the molecule, or just a name which is unique to the set?

1a) How much chemistry knowledge does it take to generate/understand the unique names? For example, “lindane” is a particular stereoisomer of a hexachlorocyclohexane; I’ve seen this botched before when someone wasn’t careful enough specifying WHICH stereoisomer they meant so they got the wrong hexachlorocyclohexane. [Note: These issues tend to be most pronounced for molecules with stereo centers.]

2) Have you checked PubChem at all for some of the cases where you manually generated names? Would it have worked to take a short name from PubChem?

Thanks.

David Mobley dmobley@gmail.com 949-385-2436

On Thu, Oct 23, 2014 at 10:12 PM, Lee-Ping notifications@github.com wrote:

I generated the nicknames by hand but it was a few weeks ago. Basically I did a Google search on every molecule with a long IUPAC name (say, >40 letters). I then tried to pick a name that was brief, legible, unique and searchable. I used the trade name when available since that usually satisfies these criteria.
My personal motivation for going through the database by hand was to get an "appreciation" for the molecules in the database, since it's chemically quite interesting - we've got everything from sugars to drugs to refrigerants and pesticides. :) It might be possible to generate these names automatically, but since "informal style" is a central purpose of these nicknames, I don't think we should require the automatically-generated name to be the only acceptable one. In terms of extensibility, we can simply make the IUPAC name the default nickname - but since the requirements are loose, anyone with a good reason can feel free to change it.

Regarding including / excluding this column from the database, I'm fine with it either way. I personally find the nicknames to be quite useful as long as we don't set strict rules for ourselves, and it's up to us (collectively) whether this is helpful for others.

Reply to this email directly or view it on GitHub: https://github.com/choderalab/FreeSolv/pull/7#issuecomment-60345390

jchodera commented 10 years ago

@davidlmobley : FreeSolv is your baby, but it is generally good practice to have some discussion before choosing to merge in pull requests like this, since it is more difficult to undo these kinds of things.

I would have liked a little more discussion here, because this now means that we have added one more item to the list of primary data that we are committed to generating by hand for every new compound that is ever added to this database. Were there a programmatic way to generate these, then this would be derived data, but there appears not to be a way to do this. We have also essentially commited @davidlmobley to manually and personally vouching for the accuracy of each and every common name that @leeping has entered, and I wonder whether he is really comfortable with that role. I am certainly not.

leeping commented 10 years ago

Hi there,

To clarify I added this column to my ForceBalance data file simply for use as a display field. This PR was created from David's suggestion and we can still revert it. Given the strict requirements of what goes into the official database, I suggest we should just revert it.

Lee-Ping

leeping commented 10 years ago

To answer David's questions:

1) By "unique", I meant unique to the set, not unique in the space of all molecules. We could always check for duplicate nicknames in the former case, but I don't think we can guarantee the latter.

1a) I think a chemistry graduate should be able to look at the nickname and have a mental picture for the molecule, or look it up on Google to see whether it's a refrigerant or pesticide, whether it has aromatic rings, etc. To look up the exact molecule safely, they should use the Pubchem ID and IUPAC name fields, which are part of the database and intended for that purpose.

2) Many of my manually generated names did come from searching the PubChem database.

jchodera commented 10 years ago

We can probably keep it if we figure out some way to convey that these names are only unofficial names and should never be used for producing molecules, and that their presence is optional. Perhaps "unofficial common name" or something is a better title?

I just think we need some discussion of the implications before we commit things.

leeping commented 10 years ago

Hi John,

I think the description of this field should be included in the README as well as the header to prevent people from misusing it.

I don't mind changing the title to something other than "nickname", but I can't guarantee that it's the common name either; I only looked up shorter names for the molecules that had long IUPAC names.

Perhaps "display name" or "unofficial name" would be okay.

Thanks,

Lee-Ping

MobleyLab / FreeSolv

Add a nickname column to database.txt #7

==============================================================================================================================================

| Hydration free energies (kcal/mol) |

| Target: FreeSolv Type: Hydration_OpenMM Objective = 1.63924e+03 |

| ID Nickname Reference +- StdErr Calculated +- StdErr Calc-Ref Weight Residual |

==============================================================================================================================================

Okay, the nicknames are added to the pickle. It's only 10% bigger than the original pickle so I think the compression format was right.

a unique, programmatic way to compute them?

Regarding including / excluding this column from the database, I'm fine with it either way. I personally find the nicknames to be quite useful as long as we don't set strict rules for ourselves, and it's up to us (collectively) whether this is helpful for others.