OpenSourceMalaria / OSM_To_Do_List

Action Items in the Open Source Malaria Consortium
82 stars 13 forks source link

How do we search, store, index, annotate molecules? #285

Closed drc007 closed 9 years ago

drc007 commented 9 years ago

I was thinking of building a file containing molecules and the associated biological data however I'm finding difficult to identify all the molecules that have been made on the project, whether there is a unique identifier for every molecule, and then find a link to any data that might have been generated on a molecule. I've seen various web pages that contain SMILES strings but I've no idea if these are comprehensive. Some have identifiers but I don't know if the identifier refers to the parent compound, if it changes with salt form or how we distinguish between batches of compound. How are OSM numbers assigned and how do check a compound has not been prepared before? As assay data is generated how do link this to a parent/salt form/batch and how do you decide what data can be averaged? Whilst I could plough through the web pages looking for information it would soon be out of date, so it would be better if there was a sustainable model in place for new molecules and data.

mattodd commented 9 years ago

Thanks for this Chris - this is the most pressing problem we have as a group.

One possibility is that we use a commercial compound registration system. I'm pessimistic about that for multiple reasons, not least because it would be likely that such a solution could only be accessed by a sub-group of people. We'd also be beholden either to continued project funding or the benevolence of a company. We'd have no control of the software. And we'd still be needing to input data to such a system manually.

A better alternative is that we build our own system. We started trying to do this with the compound pages (http://malaria.ourexperiment.org/osm_procedures). But it's not sustainable - too much effort.

Even better is that we tag all our ELN entries with strings and we develop a way to scrape that and auto-construct a CRS. That's hard, but also relies on people manually inserting strings. That's not working at the moment.

I think we need to reconsider how the lab book works. Bill Mills at Mozilla suggested that instead of writing up a lab notebook we fill in forms which are then used to construct a relational database of all the compounds in the project and all the associated experiments. That database then writes the ELN for us, and is used to create what we need - the Structure-Data File (SDF).

I wrote up these possible approaches here: http://openwetware.org/wiki/OpenSourceMalaria:Technical_Operations#Data_Management

This doesn't solve the problem in the short term, but this software, if it could be written, would solve compound management for us and for any open source medchem group. Maybe it's time to make a big push to make it happen. An advantage is that it forces humans to enter information required to make the project completely machine-readable. If the software can automatically generate human-readable ELN entries there would likely be a time saving on that alone.

Does this make sense?

In the interim we could assemble an SDF-lite by adapting the text at the bottom of the wiki (http://openwetware.org/wiki/OpenSourceMalaria:Triazolopyrazine_%28TP%29_Series#Strings_for_Google). I have a student volunteer interested in doing this right now as it happens and I'll loop him in. It'd mean adding in potencies to each of the strings as a first step.

In terms of codes-for-batches - we're not doing this, but we can, fairly trivially I think. It'd mean adding numbers to the OSM codes we've been using. I've written before that I don't know what to do about scalemic compounds (particularly when we don't know if they are scalemic: #172

Looping in @cdsouthan @murrayfold who I know have views on this, but this is open to suggestions from anyone who has used the ELN and thought about data management.

bkatiemills commented 9 years ago

thanks @mattodd - that about sums up what I was suggesting WRT entering data into a database first (via a web form or whatever you like), and then using that database to build notebooks and answer other questions later.

In order to move forward there, we're going to need a clear understanding of what raw information we need an actual human to input into the database to begin with; then we'll need to understand what the process is for turning that raw information into other interesting things (like how to build an SDF out of it).

mattodd commented 9 years ago

OK. @alintheopen do you want to nominate a lab book entry that possesses a decent amount of complexity (long-ish procedure, attached pictures, attached data, clear conclusion and final yield - that's all our entries, right?) and then I can work towards putting the content into a form that consists of data fragments that one could input instead, and from which the entry could be reconstituted by machine?

drc007 commented 9 years ago

Perhaps you need to think first whether you want the compound registration system to include all compounds that are made on the project, including intermediates. Or whether you want to only register final compounds that are sent to biological testing.

If the latter you could have a very simple registration system. Perhaps

Unique ID (I don't know how these are generated) Structure of parent as a SMILES string Salt form Batch number Name of person who registered the compound. Reference to Lab Notebook (This could be a URL for electronic entries or book number/page number for paper notebooks).

This should be easy to set up and would require little ongoing support, and I'd be concerned that a more complex registration system might put off people. However it would not provide a repository for the experimental procedures etc.

mattodd commented 9 years ago

The "lite" form you mention is certainly enough from the point of view of browsing potencies and identifying SAR data. Even this has potential issues of manual compliance. But it seems to me we could do this using something as simple as the wiki - by adapting this section, for example:

http://openwetware.org/wiki/OpenSourceMalaria:Triazolopyrazine_%28TP%29_Series#Strings_for_Google

The "reference to the lab notebook" is, in the section linked to above, the molecule pages we've been constructing. See for example the link following the code "MMV668958". It could be a link to an individual experiment, but that is less useful.

This acts like an SDF, in being a simple text file, and could presumably easily be morphed into an SDF.

The alternative is to have a manually annotated genuine SDF sitting somewhere. This was not being well updated. Why was that, again, @murrayfold? The wiki is quite easy to edit by anyone. Chris is that all the info that's needed - the items you mention? Include potency? Like I said I think I have some personpower that can help make that section of the wiki more useful for this purpose, so we could play with that too.

But the kind of thing that would do all of this and be more powerful is if there is a system that "understands" the content and allows one to browse all the data about a given molecule - synthetic attempts and biological data beyond potency. Such a thing could easily collect together all the data for different batches, for example. A much bigger project. But one of the drivers behind our writing up the compound pages was to collect together all the disparate synthetic attempts at the molecules in the project. In some cases a molecule was being synthesised by 4-5 different people, and it was easy to lose track of the data given that people were not always 100% reliable in linking between experiments.

On 25 March 2015 at 18:09, Chris Swain notifications@github.com wrote:

Perhaps you need to think first whether you want the compound registration system to include all compounds that are made on the project, including intermediates. Or whether you want to only register final compounds that are sent to biological testing.

If the latter you could have a very simple registration system. Perhaps

Unique ID (I don't know how these are generated) Structure of parent as a SMILES string Salt form Batch number Name of person who registered the compound. Reference to Lab Notebook (This could be a URL for electronic entries or book number/page number for paper notebooks).

This should be easy to set up and would require little ongoing support, and I'd be concerned that a more complex registration system might put off people. However it would not provide a repository for the experimental procedures etc.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-85887798 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

cdsouthan commented 9 years ago

I could do with a bit more context here before seeing if I can contribute. I have made suggestions in earlier fora but it sounds like Chris 007 is offering to build a "registration lite" for OSM and this dialogoue is scoping it out. This would be great of course. However, there is a wider context of possibly meeting broader requirements for any OSDD team. As I see this needs a chemistry ELN, a bio ELN, a registration system, SAR analysis pipelines and feeds out to databases such as ChEMBL and PubChem. But for this thread this might be out of scope.

mattodd commented 9 years ago

Well, there are to possibilities. A "lite" manual inclusion of core data in a text file (either on the wiki or in an SDF). This is doable for now, but not a good solution long term. There is then the Master Plan of a way to maintain an ELN that is aware of its contents. Bill suggested we don't maintain an ELN, but instead maintain something that writes the ELN. i.e. the thing we make, when we record our experimental data, is a relational database of objects and properties. This can be "trivially" adapted to make an SDF. It can also, Bill thinks, be used to write the ELN in a way that we can read it.

Just as one example: In a chemistry ELN we often write up our experiments in a long form way that could, if we're honest, be written instead as a series of actions and a series of times, i.e. in a spreadsheet. String X was combined with string Y. Heating 40 minutes. Cooling. Addition of solvent X. And so on. This may, when writing it, feel artificial, but the advantage is that the machine reading it does not have to interpret our lyrical English language in order to hazard its best guess as to what we're doing. And the discipline of having people enter all the required data would help us. Imagine an ELN entry where we enter strings, batch codes and potencies. That relationship can then be used to link automatically the potency with the experiment that made that batch, avoiding the need for the human to create the link.

And yes, any solution needs to be open so that anyone can use it for any chemistry or biology project - that's essential.

On 25 March 2015 at 23:17, cdsouthan notifications@github.com wrote:

I could do with a bit more context here before seeing if I can contribute. I have made suggestions in earlier fora but it sounds like Chris 007 is offering to build a "registration lite" for OSM and this dialogoue is scoping it out. This would be great of course. However, there is a wider context of possibly meeting broader requirements for any OSDD team. As I see this needs a chemistry ELN, a bio ELN, a registration system, SAR analysis pipelines and feeds out to databases such as ChEMBL and PubChem. But for this thread this might be out of scope.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-86002103 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

miike commented 9 years ago

@mattodd We could feasibly do something of that stuff using ChemicalTagger. I've tried running it across the notebook entries with some success in the past.

As for a registration system something externally is definitely needed that isn't built into Labtrove. It's possible to extract a large number of OSM/MMV ids out of the current posts though.

mattodd commented 9 years ago

Yes, the use of tools to digest written information into a machine-understandable form is definitely possible, but this does rely on the human entering the information correctly and fully in the first place. I wonder how well it does with datasets, i.e. in understanding that a dataset contains the NMR spectrum associated with a given SMILES string, for example. Perhaps when we've decided on a sample ELN entry to try this all out on, we can see what ChemicalTagger makes of it.

One of the things that is apparently doable is to have a bot generate lab notebook entries. i.e. the data etc could indeed be external and piped into the ELN. I'd not considered this.

On 26 March 2015 at 12:22, miike notifications@github.com wrote:

@mattodd https://github.com/mattodd We could feasibly do something of that stuff using ChemicalTagger http://chemicaltagger.ch.cam.ac.uk/. I've tried running it across the notebook entries with some success in the past.

As for a registration system something externally is definitely needed that isn't built into Labtrove. It's possible to extract a large number of OSM/MMV ids out of the current posts though.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-86282085 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

cdsouthan commented 9 years ago

As we know the direct capture/archiving of spectra and the formalised representation of synthetic schema/protocols are on the ChemSpider road map (but direct from instrument rather than posible mangling in the ELN). OK granted the team want robust working solutions asap. IMHO bots can certainly do good grunt work with sophisticated heuristics. However, against the 100s of hours the chemists spend on the job even 1 or 2% entering ELN data via structured input forms (tedious but crucial) gives big payoffs in data handing downsteam (inc hooking up with the reg system and the NMR machine outputs and even a ChemSpider feed)

miike commented 9 years ago

Agreed, I think we can automate some of this to make the data entry easier and I wish there was a silver bullet but it seems like structured data entry (guided) is going to give the biggest payoff despite having a larger upfront cost.

drc007 commented 9 years ago

One of the big inducements for using an ELN is that the experimental procedures can then be cut and pasted into publications, theses, patents etc. However at the time of entry the payoff can seem to be in the far distance!

mattodd commented 9 years ago

Question related to the "lite" version. If we assembled the core data on the Series 4 wiki, in such a way that it could be used to make a bare bones SDF, what data should we include? At the moment, here

http://openwetware.org/wiki/OpenSourceMalaria:Triazolopyrazine_%28TP%29_Series#Strings_for_Google

I have, for the first compound (MMV639565) the following: MMV Code, (OSM Code and link to page), SMILES, InChI, InChIKey and Potency in μmol separated by spaces. Can this be easily ported to an SDF? If not, how should the data be recorded?

Alternatively, should we write the data there in the form already needed for the SDF - should we maintain text in an SDF standard, or does that require connection tables, which would be too cumbersome for a wiki page?

drc007 commented 9 years ago

A tab delimited text file as described should be simple to convert into sdf format.

drc007 commented 9 years ago

I my experience manually editing sdf files in a text editor is fraught with errors.

mattodd commented 9 years ago

Sure. If you look at the legacy SDF in Github it makes one think that maintaining a simpler version of a text file on the wiki page (easy to edit) could be useful. i.e. not full thing, just core data needed to browse structures/potencies + links out to more data.

(SDF is here https://github.com/OpenSourceMalaria/OSM_compounds - can click directly on the file and view on the page)

On 27 March 2015 at 00:31, Chris Swain notifications@github.com wrote:

I my experience manually editing sdf files in a text editor is fraught with errors.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-86514155 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

bkatiemills commented 9 years ago

Just want to reiterate that manually maintaining SDFs is exactly what we want to avoid, because of the fraught nature @drc007 points out.

To add a little technical clarity, I'm thinking of leveraging the ELN's API to build an ELN off of a database - this in principle lets us completely separate data entry from data presentation - a very powerful and convenient design feature.

Big big +1 to @mattodd's suggestion of parsing synthesis into action / time pairs for machine readability; this lets us repackage that information conveniently any way we like in future.

mattodd commented 9 years ago

Agree 100% Bill - just thinking of a temporary solution for one of the series of molecules we're looking at while we're writing up the work and thinking of what to do next. This is separate from the ongoing need of the project to automate.

Pinging Alice (who was away) to nominate a good ELN entry to work on - i.e. to deconstruct the entry into machine-understandable data that could be used to reconstruct the ELN entry, and thereby think about what kind of interface would be needed to get a human to input the required machine-understandable data.

Agree with this philosophy that it's desirable to separate the data entry from the data presentation.

On 27 March 2015 at 07:04, Bill Mills notifications@github.com wrote:

Just want to reiterate that manually maintaining SDFs is exactly what we want to avoid, because of the fraught nature @drc007 https://github.com/drc007 points out.

To add a little technical clarity, I'm thinking of leveraging the ELN's API http://docs.labtrove.org/2.3/lt/Using_the_Rest_API to build an ELN off of a database - this in principle lets us completely separate data entry from data presentation - a very powerful and convenient design feature.

Big big +1 to @mattodd https://github.com/mattodd's suggestion of parsing synthesis into action / time pairs for machine readability; this lets us repackage that information conveniently any way we like in future.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-86692744 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

cdsouthan commented 9 years ago

Would it be possible for Labtrove to work on this ? AWK there must be < 100's of global OSDD teams (not to mention 1000s of SMEs) with essentialy identical requirements for chemistry ELN <> structures <> reg system <> bioassay ELN > filtered feeds to major public dbs

mattodd commented 9 years ago

That's exactly the impact, which is why this needs proper funding. I think we show what could be done first in a test case, then extrapolate to an open source ELN that "understands" its contents.

On 29 March 2015 at 23:29, cdsouthan notifications@github.com wrote:

Possible for Labtrove to work on this ? AWK must be < 100's of global OSDD teams (not to mention 1000s of SMEs) with essentialy the identical requirements chemistry ELN <> structures <> reg system <> bioassay ELN > filtered feeds to major public dbs

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-87404468 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

drc007 commented 9 years ago

@cdsouthan @mattodd And there are a number of commercial vendors who supply the software to do exactly that. However whilst you might expect that requirements for a chemistry ELN would be similar, in practice everyone wants it customised to their own needs.

cdsouthan commented 9 years ago

Which bings us back round to the idea that OSM should at least consider a (probaly favourbly academic priced) commercial solution, as an interim measure. AWK results from this can be completely open. Given the project has dependancies on Syd Uni-funded SciFinder and ChemDraw anway it seems logically inconsistant to need all the other stuff to be OS (but its of course fine where this fits robustly)

mattodd commented 9 years ago

The "everyone wants it customised" is one of the main arguments for the product being open source, and a key weakness of a proprietary piece of software. My conception here is something that is useful for OSM but also any other open source chem/bio/science project.

Yes, the idea of our temporarily using something commercial is a good one that we've talked about a little before. There's no philosophical inconsistency in using proprietary software if it does not interfere with the ability of people to do the science - we have to use what works, and I don't complain about people contributing to OSM using Windows. On the negative side we'd need to invest time in identifying the product, buying it (or negotiating a deal) and installing it, becoming familiar with it and then establishing a way of allowing people to enter data (single person or group). Ultimately we would want anyone making a compound to be able to register it, meaning we'd ideally have a group login covering any contributing chemist. All of this setting up takes time that we don't have to expend (on this temporary solution) if we just enter minimal data manually for the compounds we currently have, e.g. using text on a wiki, which is inelegant and ridiculous long-term but is something we can do right now.

I think the thing that would tip the balance is if we could identify and source a commercial product that does the minimum we need and which allows a group login for OSM (say, 5 people for now using shared credentials). We would need the output to be publicly accessible. For any company willing to donate such a licence to OSM there would be a good about of positive PR. We could renegotiate after a year. We need something now - we don't have time to wait for a University-sponsored tender process.

Would anyone be willing to suggest one or two possible solutions, AND to reach out to the companies to discuss? Please forward this thread. If this needs to go through me then it won't get done in the short term since I don't have a good sense of which product might be worth pursuing and I don't speak cheminformatics fluently. I'm thinking for example of whether anyone has existing contacts at companies like ChemAxon.

On 30 March 2015 at 03:27, cdsouthan notifications@github.com wrote:

Which bings us back round to the idea that OSM should at least consider a (probaly favourbly academic priced) commercial solution, as an interim measure. AWK results from this can be completely open. Given the project has dependancies on Syd Uni-funded SciFinder and ChemDraw anway it seems logically inconsistant to need all the other stuff to be OS (but its fine where fits robustly of course)

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-87435933 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

drc007 commented 9 years ago

There is a review of ELNs here http://www.rsc.org/chemistryworld/2013/05/electronic-lab-notebook-review. I know of at least 40 different ELNs but I don't think ChemAxon has one?

mattodd commented 9 years ago

Hi Chris - I wasn't thinking of an ELN. I'd like to continue to use the system we have. I was thinking of a stand-alone, light compound registration system which we could use to generate an SDF.

On 31 March 2015 at 01:01, Chris Swain notifications@github.com wrote:

There is a review of ELNs here http://www.rsc.org/chemistryworld/2013/05/electronic-lab-notebook-review. I know of at least 40 different ELNs but I don't think ChemAxon has one?

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-87690600 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

bmpvieira commented 9 years ago

Maybe dat could be used as a data versioning backend and REST API for this? /cc @maxogden @mafintosh @karissa

bkatiemills commented 9 years ago

@bmpvieira oh hey, I didn't realize you had your eye on this project! Yes, I agree - Dat could be a great option for data management here.

alintheopen commented 9 years ago

https://db.tt/o4CFayp2

image

Hi All,

I've put together a spreadsheet featuring the different data that we would want to input and generate in a typical ELN entry. Each entry contains slightly different data depending on the exact nature of the procedure but I've tried to outline the key features.

The idea here is that rather than creating an ELN entry in the way that we do now, the scientist would input the minimum data required into a spreadsheet that would have push the data to the experimental entry and also an archive page featuring compound index/quantity and associated data. In order for all data to be captured and write-ups to be high quality I think that ideally we need to design something that has a smaller associated workload for an experimentalist than using a traditional paper lab notebook and one that features alerts/notifications until an entry is 'closed'. By closed, I mean that all data has been uploaded, a conclusion has been made and all data pushed to the index page.

I've picked AEW 215-1 as an example even though the experiment isn't 'closed' yet, as the product isn't completely pure. I thought this might make it a more interesting entry to work on but I can easily change to a completed experiment if preferred. (http://malaria.ourexperiment.org/triazolopyrazine_se/11384/Synthesis_of_R234fluoromethoxyphenyl124triazolo43apyrazin5yloxy1phenylethan1amine_AEW_2151.html)

I've used a colour coded key to explain the different data I have listed in the spreadsheet- here comes the breakdown:

Manual Data Input I thought I'd start with the absolute minimum required for each entry (please feel free to chip in Mat if you think that I have missed anything out.)

InChiString for starting material - quickest input would really be to just sketch the SM to paste this into the spreadsheet/ELN but until we have a system that can read structure accurately, an InChi string generated from ChemDraw or another drawing program would suffice. The InChi Strings for other reagents and the product would also need to be manually added. The InChi would the generate MW data and potentially (if commercial data density too and MSDS) so that now the scientist need only add the starting mass and equivalents required for the limiting reagent.

The only other data that would need to be manually added is the actual mass of product obtained, yield could be automatically calculated from this data.

Automatic Generation (spreadsheet and ELN)

This data could be generated automatically from the minimal data requiring manual entry (shown above) and could be included in both spreadsheet and ELN. for example, mass of any other reagents would be automatically calculated using the MW data extracted from the string and the no. of equivalents. Smiles and any other strings could also be generated from InChi.

Automatic Generation (just ELN)

Previous attempts to include structures in excel spreadsheets have lead to bulky files that are difficult to navigate. Therefore, perhaps the structural information would just appear on the ELN in the form of a reaction scheme for individual experiments and as an image in a tabulated index featuring all data for the compound.

External Link

This section is the 'push' to a GitHub post or to give the entry a DOI or 'pull' to embed DOI's of any publications referenced in the post.

Placeholder in ELN

This category denotes 'placeholders' or headings in the ELN so that a reader can easily identify sections of the write-up and the different sections could be extracted and saved in a database.

Uploaded Content

This is the most data heavy part of the ELN and the thing that makes our lab book so powerful - the raw data. I don't know if it is ossicle to upload links to a spreadsheet first and then embed them in the ELN or if they would need to be manually uploaded. The selected entry included NMR data, ISOLERA data, hazard and risk assessment and will later include IR, HRMS, optical rotation etc.

alintheopen commented 9 years ago

Link to spreadsheet at top of previous post (originally created in numbers but PDFs, excel and csv are also attached)

cdsouthan commented 9 years ago

Alice, the above looks great - but to what extent are you converging or diverging from the Southampton efforts below ?

J Chem Inf Model. 2015 Mar 23;55(3):501-9. doi: 10.1021/ci5005948. Epub 2015 Feb 25 . ChemTrove: Enabling a Generic ELN To Support Chemistry through the Use of Transferable Plug-ins and Online Data Sources.

Day AE1, Coles SJ2, Bird CL2, Frey JG2, Whitby RJ2, Tkachenko VE3, Williams AJ3.

In designing an Electronic Lab Notebook (ELN), there is a balance to be struck between keeping it as general and multidisciplinary as possible for simplicity of use and maintenance and introducing more domain-specific functionality to increase its appeal to target research areas. Here, we describe the results of a collaboration between the Royal Society of Chemistry (RSC) and the University of Southampton, guided by the aims of the Dial-a-Molecule Grand Challenge, intended to achieve the best of both worlds and augment a discipline-agnostic ELN, LabTrove, with chemistry-specific functionality and using data provided by the ChemSpider platform. This has been done using plug-in technology to ensure maximum transferability with minimal effort of the chemistry functionality to other ELNs and equally other subject-specific functionality to LabTrove. The resulting product, ChemTrove, has undergone a usability trial by selected academics, and the resulting feedback will guide the future development of the underlying ELN technology.

drc007 commented 9 years ago

A couple of thoughts.

Would it be better to have the workflow from left to right mirroring a reaction scheme? Reactants on the left and product on the right. I presume you would be able to add more than two reactants if for example you were doing a Ugi reaction? How do you distinguish between reactants and reagents?

Should the link to the starting material be a field in each of the in the starting material sections? Will the link be to the actual reaction batch that provided the starting material, or pages that detail the synthesis whether or not they actually provided that specific batch?

How would you handle a reaction that gave more than one product, e.g. isomers?

Does the calculated data (MWt etc) refer to parent compound or salt form(s) if isolated as such? Is EA elemental analysis? If so will the final product data be adjusted for adventitious water/solvent etc?

I guess a more philosophical question is why do you need another application, isn't this simply a different view of the same data captured in the ELN?

cdsouthan commented 9 years ago

I would just like to round off the earlier discourse on possible commercial solutions relevant to this thread. This is more in respect of the reg system rather than the ELN side, since Open Source momentum seems further advanced for the latter. Matt's reply https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/285#issuecomment-87491509 summarised the issues. Personally I'd have to say that requirement specifying and brokering on OSM's behalf would not be practical (and my main ChemAxon contact has moved on as it happens). I also I suggest you absolutely need local support for the OSM core team to a) explain in detail what it says on the box, b) make sure it actually does it out of the box and c) to come back in to fix it if it stops doing it. Consequently, having me (or anyone else even) in this loop would simply not work. The other problem is that openly using the term "interim solution" might not get the requisite sales rep out of bed. By all means put a call out (you have more social media reach anyway). However, why not also ping the Southampton crew along the lines of "good job on ChemTrove - so can you knock up a reg sys to go with it ?"

mattodd commented 9 years ago

Hi @cdsouthan - as I understand it Chemtrove is a chemical layer on top of Labtrove intended to see chemical content. The system we're talking about here would involve data entry into something else entirely that could be used to build an ELN entry. I think these are fundamentally distinct. If Chemtrove provides a level of understanding that permits aggregation of human-written ELN entries into collections of relevant pages ("all synthetic attempts at molecule X", "Biological data relevant to molecule Y", even an SDF) then that might work. I guess it depends a little on the nature of the output, and what can be done with it. Again, perhaps the best thing to do is some stress-testing with a sample ELN entry, which Alice has now provided. i.e. for the entry as shown, what can Chemtrove do with it vs what are our requirements?

Re the point about trialling a system - yes, we'd need someone willing to roll it out and support it (technically) a little. The upside is that people could see in detail the capabilities of the system in handling a real dataset. I don't expect to drown in offers, but you never know.

drc007 commented 9 years ago

@mattodd @cdsouthan Hi, I've managed to set up a example of how we might view data, after a great deal of thought I decided that the best option would be one where there is no need for a dedicated SQL database, no client software to install and it is free and open source. The solution is to leverage the superb work by Luc Patiny at EPFL to create a way of viewing your data using just a web browser (requires CHROME at the moment). If you open this link in Chrome you can see an example view of the data http://www.cheminfo.org/Chemistry/Parsing%20data/Tab_delimited_Parallel_Coordinates.html?tsvURL=http%3A%2F%2Fgoogledocs.cheminfo.org%2Fspreadsheets%2Fd%2F1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc%2Fexport%3Fgid%3D0%26format%3Dtsv

The data is read automatically from a Google docs spreadsheet that I have entered data by hand taken from the web pages

https://docs.google.com/spreadsheets/d/1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc/edit#gid=0

If we want to proceed with this approach we need to complete the google docs spreadsheet, to add missing compounds/data. We can then think about what other data we want to add to the spreadsheet, it would be nice to include the URL to a notebook page describing the synthesis. As more biological data is added we might also want to add links to the lab report page containing the experiment where the data was generated.

At the moment a range of chemical properties are calculated automatically, and there is a table (with search boxes at the top of each column)and a parallel coordinates view but there is the option to add different types of visualisation, structure-based searching etc. all within the web browser.

mattodd commented 9 years ago

Sounds very interesting. But getting a blank sheet for the link in Chrome (and Firefox).

http://www.cheminfo.org/Chemistry/Parsing%20data/Tab_delimited_Parallel_Coordinates.html?tsvURL=http%3A%2F%2Fgoogledocs.cheminfo.org%2Fspreadsheets%2Fd%2F1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc%2Fexport%3Fgid%3D0%26format%3Dtsv

Anyone else?

drc007 commented 9 years ago

Do you not even see the headers?

drc007 commented 9 years ago

forosm Should look like this.

mattodd commented 9 years ago

Just this.

screen shot 2015-04-15 at 11 27 55 pm

What you just posted looks far more interesting. Am using Mac..?

drc007 commented 9 years ago

I'm using a Mac also. Can you see the google docs spreadsheet? (The second link).

mattodd commented 9 years ago

Yes, spreadsheet is fine I think - text from the strings section of the wiki, i.e. the same amount of data.

drc007 commented 9 years ago

Yes that's correct. Must be an issue with access to be web server. Will check with Luc.

drc007 commented 9 years ago

Looks like it might be an issue with the development web server. Here is a new link.

http://goo.gl/UW4dxT

Remember it currently needs Google Chrome.

mattodd commented 9 years ago

That works - very impressive. I'd not seen this - fantastic stuff. (also works for me in Firefox btw).

Questions: 1) it's any tab-delimited file? 2) We can specify which columns are displayed and in which order? 3) We could add/remove columns from the Gdoc (I'm thinking URL, yes) and that would be reflected in the webpage display? 4) Could we also import an sdf, if we were to make one? 5) Q for @madgpap - if we maintained a file that was read in this way, could the molecules also be displayed at Chembl?

If we pursued this solution we'd need to designate the Gdoc as the primary place for data on the molecules, or we provide a tab delimited file elsewhere, or we generate an SDF.

Historical note - very early in OSM we played with a Google Doc containing all the structures on the project (happy days @incoherentboy ) but the file became too cumbersome because of the images. This solution gets round that issue.

drc007 commented 9 years ago

1) It can be used to display any tab-delimited file but you probably need an ID column followed by a SMILES column with appropriate headers. 2) Would each user want their own display? 3) URL should be fine, what columns do you want to add/remove? 4) SDF import would be possible but manually creating/editing an sdf is a far from trivial task. 5) What other data would you want included?

You can download a tab-delimited text directly from the Google Doc, I then converted it to an sdf locally without issue. Do you want this to be automated?

I suspect you were using ChemDraw images? They contain all the information for setting up the printer and are huge. I would strongly encourage you to ONLY include alphanumeric data in the google doc. Then is the worse happens everything can be recovered using a simple text editor.

lpatiny commented 9 years ago

Hello, My group is the author of those tools and we organize un workshop in Lausanne next Friday on how to use our framework. If you can come ... you are welcome (it is free) ! To answer some of your questions --> Would each user want their own display? Yes. You can see plently of fancy example using our framework on http://www.cheminfo.org --> SDF import We could convert your SDF to tab-delimited. If you want this we could make a tool online as well ChemDraw allows to directly copy the SMILES for a structure, there is even a shortcut for it From your table we could generate a SDF to download. This could be done directly from the resulting view and the SDF would be generated in the browser. We have all the technology to do it ... this means I guess we just need half a day of work

drc007 commented 9 years ago

We need to think about the column headers, "Potency" really is not much help describing the target/assay.

Can we have a systematic naming convention, perhaps

Target measurement units

PfaI IC50 uM PfATP4 IC50 nM HERG pIC50 THP1 IC50 uM

It would be useful if we had a URL to a page describing the assay.

Also the activity column should only contain numbers, not ">" or "<", if we need to include qualifiers then we need a separate column.

Some records have OSM identifiers with links to pages describing the synthesis e.g. http://malaria.ourexperiment.org/osm_procedures/9868/Preparation_of_OSMS201.html is the plan to have this for all compounds?

lpatiny commented 9 years ago

I would suggest to have another google docs document that would describe the columns

I give it currently in read only mode here (Chris you have now access to this whole google docs folder, please share it with the person that are interested):

https://docs.google.com/spreadsheets/d/1CyF9O1zu46I-l2eRgRtMK8oBZRJV2um0482j17aVIC4/edit?usp=sharing https://docs.google.com/spreadsheets/d/1CyF9O1zu46I-l2eRgRtMK8oBZRJV2um0482j17aVIC4/edit?usp=sharing

In the future we will then be able to join those 2 excel spreadsheets to give all the information on one page

If you want to go for our “visualizer” approach we can also create a specific “flavor” with various way to see / browse your data.

We have such site that includes some specific tools. For example for my biooriented organic chemistry class I have

http://www.cheminfo.org/flavor/biooriented/index.html http://www.cheminfo.org/flavor/biooriented/index.html

In a similar way we could create a malaria website that would just display the data from the google docs (spreadsheet) files

We can also display related PDB if there are any interesting in your field (and I guess there must be a lot).

Another spreasheet in google docs … could be the PDB code and some comments about those protein. You could then browse them with a view like http://www.cheminfo.org/Protein/JSMol/PDB_Selector.html http://www.cheminfo.org/Protein/JSMol/PDB_Selector.html

We can also display any structural analysis data, xray, 3D models, Mass spectra, NMR spectra, ...

drc007 commented 9 years ago

@mattodd Matt if adopted this would be a significant change for the project do you want to organise a meeting to discuss? Do you use google hangouts?

drc007 commented 9 years ago

@lpatiny Will the workshop be streamed/recorded? I'm sure lots would be interested

lpatiny commented 9 years ago

We will try to do something for the recording but it is not obvious it will be done the same day