OpenSourceMalaria / OSM_To_Do_List

Action Items in the Open Source Malaria Consortium
82 stars 13 forks source link

Chemical Registration System #361

Open drc007 opened 8 years ago

drc007 commented 8 years ago

There have been a few mentions of the need for a chemical registration system and I thought it might be useful to collect needs/requests/desires in one thread and then I'll have a look at what might be possible.

The key question is what do you want the system to do, here are a few thoughts to get you going.

  1. Check molecule has not been registered before, error check structure
  2. System to assign a unique identifier and date time stamp.
  3. Support for multiple drawing packages, ChemDraw, Marvin, Elemental, JSME
  4. Generate SMILES, InChi, InChi key, IUPAC name
  5. Display image of structure
  6. Chemists adds lab notebook number
  7. Automatic calculation of physicochemical properties
  8. How do you deal with multiple batches, salt forms?
  9. Ability to update with other identifiers, ChEMBL, ChemSpider, PubChem etc.
  10. Restrictions on who can edit existing structures
  11. Ability to share a link to a record
MedChemProf commented 8 years ago

@drc007 The list of capabilities for the registration system seems pretty complete and ambitious. Some professional commercial systems that I have used in the past were not that capable. Since I do not have the programming background to know how difficult it would be to implement all of the mentioned features, I thought I would suggest also thinking about a minimum set of requirements. Perhaps if there was a method implemented that only accomplished items 1 and 2, then that would solve some of the recently discovered duplication problems found in the Master Tracking Sheet. Might the minimum feature set in combination with the master spreadsheet be a way to go?

mattodd commented 8 years ago

Thanks for kicking off this discussion Chris.

Items 1-5 yes, those would all be useful.

Point 6 - If what you have in mind is that a chemist manually adds something, then from experience this won’t work. Much better would be a system that can be pointed at a domain and extract any lab notebook URL that contains that molecule. Probably hard, but that’s what we need.

  1. Yes - I imagine @lpatiny’s system can do things of this kind.
  2. Don’t know. Feels like this depends on how #6 is done.
  3. Again, ideally this is not manual, but I guess it would have to be. @cdsouthan would want this locked in.
  4. Probably there would need to be some restriction, but of the kind that we have already with the Master Sheet - people can request access. This is one of the reasons we’ve never gone with a proprietary system, on the assumption that providing access to other people, flexibly, will be a challenge.
  5. Yes, crucial. Also important that the outward-facing contents can be machine-read. This might then take care of some of #9.

In general the bigger picture here is of course that the ELN we use would talk to the registration system. i.e. data funnels from the experiments automatically into the index of molecules, solving the batch number problem and so on. I know we’ve spoken about this before (#285, #319). I’ve been in touch with Brian Marsden/Karen Porter of the SGC who offered a trial run of their ChemBioHub component ChemReg

drc007 commented 8 years ago

@MedChemProf I agree, I thought it would be useful to try and capture a superset of feature requests, we can then hopefully define a minimum subset that would be essential. First aim to accommodate the subset then hopefully add further enhancements. I certainly don't propose we should aim to have an all singing and dancing package from the start. @mattodd Whilst a link to an ELN would be useful I guess not all contributors will be using the ELN so we would need a standalone registration access also. The problem with using lab notebook/page number as a unique identifier is that everyone has a different way of naming/numbering.

Darren01 commented 8 years ago

Hi All

I found this piece of software useful when working collaboratively with different people within the same company. https://github.com/KevinLawson/excel-cdk

I hope that this is of interest to yourselves.

cdsouthan commented 8 years ago

A useful set of specs (but it was 4, not 3 I was suggesting for robustness, plus the SD file to make the Holy Heptet). However, AWAK variants of these are baked in to (dozens?) of stand-alone reg systems and customised for every big pharma company. Given that no less than 12 Molecular design and synthesis groups (including the Honorable OSM'ers) are listed for Syd U (http://sydney.edu.au/science/chemistry/research/research-areas.shtml) it seems from the outside that a shared medium-weight commercial solution (from Dotmatics, ChemAxon or whoever) makes more sense than re-invention. The IOs can be as open as anyone likes. Sure the code base is closed, but then so are ChemDraw and Marvin.

Darren01 commented 8 years ago

Point 7 in the original post, "Automatic calculation of physicochemical properties"

Some of these can be found in the rcdk package. This is a package that is an 'add on' to the R statistics package.

Hope this is of use.

mattodd commented 8 years ago

We have talked about this regularly @cdsouthan, e.g. https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-87491509 .

1) The point is not that the software code is proprietary. People can use Chemdraw or Windows or whatever. The point is that whatever platform we use needs to be one to which anyone can add data. If we can run a chemical registration system that will allow me to hand out permissions to enter data to 100 people (in the way that I can with the Google Sheet), then that may be promising and we should look into it. If licences to enter data are controlled, or cost money and need to be set up via the company, then that just won't work.

2) Similarly, if a proprietary system can be completely open, in terms of all of the data (not the underlying code) contained within it being public-facing, downloadable and machine-readable, then that too also might work.

3) I also have an eye on long-term sustainability. A proprietary solution needs to be something that will last, and something that if it dies will allow an easy port of all the data out of the sinking ship. Open source code is less vulnerable (not invulnerable) to this. Similarly if we are able to do a deal with a proprietary product (where we are granted some licence arrangement that is suitable for an open community) I don't want that arrangement expiring with a change in CEO, leaving us high and dry.

4) I do like the idea of open source code (desirable, not necessary) that we control since that makes it easier to innovate. I want a system where working on a molecule triggers other things, such as automated processes to find others working on that molecule, or relevant commercial products or relevant papers - a system that understands the content within it and starts forcing connections. Again, if there is a proprietary solution already doing this, we should look at it. But building something will allow us to create what we want.

I know of many of the current commercial solutions, such as Dotmatics (widely used by a number of academic groups) but I do not know of one that ticks all of the above though I freely admit I have not yet spent much time test-driving them. It would be enormously useful for people to report here on any commercial solution that they have themselves test-driven recently that may be suitable.

As things stand, weighing everything up at this point, my view is that the balance is tipped in favour of a new, open source solution, despite this not yet existing. The current Master Sheet is a component of that - not enough, not perfect, not a registration system, not open source code, but it has the level of flexibility, control and public-facing-ness that is desirable and @lpatiny 's cheminfo system can make elegant use of it. The balance here will respond to new data, so I'm absolutely open to argument.

cdsouthan commented 8 years ago

Fine, as a good Openess example most of the arguments have now been put politely on the table. Consequently, I have nothing to add. I may pitch in later for testing the public database x-mappings via Master Sheet extractions

MedChemProf commented 8 years ago

@drc007 Would it help (or make sense) to look for support from something like the Open Science Prize (https://www.openscienceprize.org/) to get a Registry system up and running?

drc007 commented 8 years ago

@MedChemProf Yes I had that (and a couple of other sources) in mind, but I think we need a firm proposal to go with. I think most of this can be achieved using open source tools. It could then be used as a starting point for any project.

mattodd commented 8 years ago

@MedChemProf @drc007 I agree. Just seeking clarification on something and then I can hopefully kick-start something you might be interested in in 24h or so.

madgpap commented 8 years ago

We're meeting with @drc007 today and we'll discuss it too.

mattodd commented 8 years ago

Hi @madgpap @drc007 @MedChemProf here is a 2-page description of an open source platform I and a few others such as @lpatiny would like to build. A molecule tracker from creation through to biology and archiving. See what you think - I won't preface this much except to say that we adapted the text to a pre-application submission to the Wellcome Trust last week as part of this program. If we're asked to submit a full proposal, we'd ask for a lot more money than would be covered by the Prize, meaning if we submitted something for the Prize the scale and vision of the project would be smaller or more focussed - this would avoid any risk of double-dipping.

My view is that we MUST as a community submit something for the prize along these lines, to make an impact on how we discover and develop new medicines. It's an extraordinarily good fit: "a new prize that will seek to unleash the power of open content and data to advance research and its application for health benefit" and "The resulting Open Science Prize aims to stimulate the development of novel and ground-breaking tools, services and platforms to enable the re-use of digital research outputs relevant to biomedical or health applications."

Question is: what and who? Obviously this group here, and OSM more generally, can be an excellent road-testing community, and one that could/would benefit.

mattodd commented 8 years ago

...and just to reiterate that the Wellcome app and the open science prize idea would be best, from my point of view, as open source development projects, meaning the team should be determined largely by those interested enough to contribute and with the bandwidth to do so. For the prize, we'd need to select one high-impact project we feel we could deliver, which would have a major impact on health research around the world. I like the idea of properly tracking molecules in research projects - and allowing us to make connections between people around the world who may be working on related molecules or chemical reactions without realising it. If Gmail can auto-read my email and send me adverts relevant to what I'm reading, there ought to be something watching my lab notebook and connecting me to the most relevant resources in real time.

cdsouthan commented 8 years ago

Who invites who to hangout