OpenSourceMalaria / OSM_To_Do_List

Action Items in the Open Source Malaria Consortium
82 stars 13 forks source link

How do we Manage the Size of the Master Compound List? #319

Open mattodd opened 9 years ago

mattodd commented 9 years ago

Advice needed @lpatiny @drc007 @cdsouthan (and anyone else obviously...) I want to make sure we're doing this sheet correctly, meaning a few decisions ought to be made up front. There are a number of people interested in volunteering for OSM for data entry right now.

(Context: We're doing retrospective data entry, moving away from the Experimental Procedures lab notebook as a repository of all OSM data and towards a Google Sheet that can be more easily manipulated.)

Browsing SAR is the first use of this sheet, so obviously the spreadsheet should contain compound strings and codes, as well as potencies. But the sheet will need more.

Question 1: can we extend the spreadsheet limitlessly? What we need to do is include columns for all kinds of biological data, meaning columns for values, units and qualifiers. Won't performance/usability suffer if the sheet has e.g. 100 columns? Or should we treat this as a primary data source, like an SDF, and assume that we will always extract data from it for focussed analysis?

Question 2: I'd like to include URLs for original sources of data (i.e. lab notebook pages). Do we include URLs as text in distinct cells, or can we hyperlink from text/numerical values? e.g. if compound X has potency value Y, and the raw data for that value is on web page Z do we have a separate column containing the value Z or do I hyperlink "Y" with Z.

Question 3: I want to do the same with synthetic chemistry - a way of collecting all the attempts at the synthesis of a given molecule. This was the ultimate purpose of the original "Experimental Procedures" ELN and is very useful. A sheet like this would involve a lot of columns. Again, do we hyperlink on values, or do we paste text-based URLs into cells?

Question 4: Would it be better (from a usability perspective) to have a separate sheet for Synthetic Chemistry, or does it not matter? I'd like to make sure we handle synthesis in a way that allows us to note batch numbers for preparations (which we're not doing well right now). A way of doing that is to have each preparation use the chemist's own code as the text value inserted into a given column, and then hyperlink back to the relevant page containing the prep. But doing this relies on hyperlinks in sheets being OK...

For an idea of the data that needs to be entered for a compound take a look at the page for one of the best compounds from Series 1, OSM-S-5. How best to translate what's on that page to spreadsheet format so that the data are useful but also link back to original data sources to maintain the ability to navigate the project?

Also pinging for info @PaulBarrett79 , involved in ChemReg

drc007 commented 9 years ago

Welcome to my world! As the project data gets larger the question of organisation become critical.

  1. You could extend the spreadsheet (the max number allowed is 256). However much of the information will be redundant, I presume the units will be the same for all results from the same assay. I created a second spreadsheet https://docs.google.com/spreadsheets/d/1CyF9O1zu46I-l2eRgRtMK8oBZRJV2um0482j17aVIC4/edit#gid=0 that contains descriptions of the columns, this can be used to capture repetitive information.
  2. Not sure what you mean, if you have an experiment number for the biological data can you link to that?
  3. Is the ELN structure searchable? If so use the SMILES/InChi in the spreadsheet to construct a search query.
  4. See above.

pinging @madgpap to give a ChEMBL perspective

drc007 commented 9 years ago

Just as a bit of background, this is why most companies use a relational database. You can have one table per assay, linked via compound id, then build a summary view which allows rapid searching of key project data, and allow drill down into repeats, individual batches etc. However you would probably need a paid up SQL expert to create it for you, and that is certainly not me!

mattodd commented 9 years ago

1. Yes, units should be the same for each assay. There will be multiple labs performing slightly different assays - hence the desire to capture that and link to raw data, obviously. 2. If a compound has a potency value of 23 nM, and that value arises from an experiment that is described in full on website X, do I include a column that says 23 and a column that says X or do I just have one column where the value "23" possesses a link to webpage X? Does that matter? I'm working on the assumption that the use of hyperlinks in spreadsheets is suboptimal. 3. No, not searchable, which is a major issue. Options are to fix manually by pasting strings in the synthetic ELNs (don't want to do this) or pasting links in the Google sheet (less arduous) or admitting defeat here and requiring that we use an ELN that is structure searchable.

drc007 commented 9 years ago

I tend err on the side of paranoia, I'd have new column entitled ExperimentPage and name of the page and hyperlink. That way if all else fails you can read and retype ;-)

Sounds like pasting links in the Google sheet is the only option at the moment, this will at least capture the data. Perhaps think about someone building a lookup service in the future?

wvanhoorn commented 9 years ago

1. From Google help: Size: Up to 2 million cells. There is no explicit limitation on number of rows or columns. When I download the current version as text it's 48 kb, a very long way from being large.

2/3/4. I favour that the gsheet(s) remains machine-readable, a value that is a hyperlink may diverge from that. Separate value and hyperlinks seems the safest way to do this.

Separate sheets with links between them, one containing the repeated information of the other: this is so much better implemented as a relational database as already mentioned by @drc007. I have some experience in building relational databases to store screening data, in short it's easy to cover 80% of what you want and much more work covering nearly 100% of what you want. And it would be reinvention of the wheel, there are off the shelf solutions that combine compound registration & storing of assay results (and more): Collaborative Drug Discovery and ScienceCloud.

lpatiny commented 9 years ago

Question 1: can we extend the spreadsheet limitlessly? What we need to do is include columns for all kinds of biological data, meaning columns for values, units and qualifiers. Won't performance/usability suffer if the sheet has e.g. 100 columns? Or should we treat this as a primary data source, like an SDF, and assume that we will always extract data from it for focussed analysis?

When we started the google docs project a second google docs was created to discuss about the content of each column

https://docs.google.com/spreadsheets/d/1CyF9O1zu46I-l2eRgRtMK8oBZRJV2um0482j17aVIC4/edit?usp=sharing https://docs.google.com/spreadsheets/d/1CyF9O1zu46I-l2eRgRtMK8oBZRJV2um0482j17aVIC4/edit?usp=sharing

The idea in this spreadsheet is that the first column contains the exact name of the column names of the data spreasheet.

We can then have a unlimited number of information about the analytical methods. This could include DOI, Units, …

Of course in the data spreadsheet a column should only contain values and always in the same unit. Otherwise it would be difficult to make any data mining on it. We will be able to combine those 2 spreadsheet (data and definition of columns) to display everything on the same page in cheminfo.

Question 2: I'd like to include URLs for original sources of data (i.e. lab notebook pages). Do we include URLs as text in distinct cells, or can we hyperlink from text/numerical values? e.g. if compound X has potency value Y, and the raw data for that value is on web page Z do we have a separate column containing the value Z or do I hyperlink "Y" with Z.

Question 3: I want to do the same with synthetic chemistry - a way of collecting all the attempts at the synthesis of a given molecule. This was the ultimate purpose of the original "Experimental Procedures" ELN and is very useful. A sheet like this would involve a lot of columns. Again, do we hyperlink on values, or do we paste text-based URLs into cells?

Question 4: Would it be better (from a usability perspective) to have a separate sheet for Synthetic Chemistry, or does it not matter? I'd like to make sure we handle synthesis in a way that allows us to note batch numbers for preparations (which we're not doing well right now). A way of doing that is to have each preparation use the chemist's own code as the text value inserted into a given column, and then hyperlink back to the relevant page containing the prep. But doing this relies on hyperlinks in sheets being OK...

I have the impression that the “data” spreadsheet should mainly contain the information used for datamining. However we could create other spreadsheet where the first column may be the OSM molecule ID and then other columns with lab notebook reference of any unlimited number of information.

A molecule ID could be repeated many times as well depending the need.

Something like:

https://docs.google.com/spreadsheets/d/1CqauhewG2UEZejq_ITA7n4Aix-v1tbQSiyI0arpFyaE/edit?usp=sharing https://docs.google.com/spreadsheets/d/1CqauhewG2UEZejq_ITA7n4Aix-v1tbQSiyI0arpFyaE/edit?usp=sharing

Again we can always combine all those information later so htat from a molecule you get all the related information directly.

drc007 commented 9 years ago

@lpatiny has highlighted an important point, the original idea behind the google doc was to provide a place where people could go to find data rather than having to transcribe from multiple web pages that may or may not be indexed. This looks like it will be very successful, with volunteers able to enter data in a familiar environment.

The sort of tasks you are now describing are more project data management. This tends to be less exciting but it is very important. Having multiple spreadsheets is an option but you may find getting people to enter what may be the same data multiple times into multiple places a challenge to enforce.

Perhaps you need to think about a database and user friendly web-based front end, I would have thought this would be an ideal project for a suitable final year or summer student. Especially if they were thinking of this sort of thing as a career, they would be able to point a publicly accessible demonstration of their talents!

mattodd commented 9 years ago

Thanks for this guys, very useful.

The most pressing objective is a database of biological values, for data mining and visualisation. The secondary objective is one with synthetic data, which would aid data management and paper writing. I'm happy to focus on the first objective for now, i.e. manual entry of all the biological data, but which I mean potency but also things like metabolic clearance data. The secondary objective would be mostly solved if we used a lab book that was structure searchable.

Thanks for the reminder about the second sheet that contains the details of the assays, units etc. This brings up an important issue that must have a generic solution. What if three labs are running a similar assay? How do we capture that the data are clearly related but have a different provenance? As an example, here's a screenshot from the first OSM paper we're about to submit, where the data from different labs are shown and footnotes are used to direct the reader to the different assay methods and raw data files (which are in the SI and in the ELN).

screen shot 2015-07-15 at 9 48 27 pm

To capture this, should there be an additional column in the primary Google sheet that reads, for example, "Pfal EC50 Assay Number", with values 1, 2, 3 or somesuch, where the second sheet captures the details of the nature of the assay? This is important (and is managed I think in Chembl this way @madgpap ?) because different labs running the same assay can generate very different data, obviously.

I'm about to input a bunch of metabolic clearance data in advance of the meeting on Monday, and ought to try my best to capture this, so it might be a chance for us to decide this "live".

Relational databases - yes. Want something open source, and I'm sure @lpatiny knows well what is available or in development. Am assuming that chemreg is aiming for this, but I'm not sure. As you've said above, a huge advantage of the current use of a Google sheet as the front end for data entry is that the data entry is simple and therefore crowdsourcable.

lpatiny commented 9 years ago

Thanks for the reminder about the second sheet https://docs.google.com/spreadsheets/d/1CyF9O1zu46I-l2eRgRtMK8oBZRJV2um0482j17aVIC4/edit#gid=0 that contains the details of the assays, units etc. This brings up an important issue that must have a generic solution. What if three labs are running a similar assay? How do we capture that the data are clearly related but have a different provenance? As an example, here's a screenshot from the first OSM paper we're about to submit, where the data from different labs are shown and footnotes are used to direct the reader to the different assay methods and raw data files (which are in the SI and in the ELN).

https://cloud.githubusercontent.com/assets/4386101/8697799/6c2e8146-2b3b-11e5-919b-61f1baf83ba8.png

3 different labs making the same test should yield to 3 different columns. It is never exactly the same experiment …

And 3 different columns means that in the description of the column (the other spreadsheet) there should be the lab name.

After we can always decide if we want to make an average, max, min or whatever if it is the same test.

In the spreadsheet of the test it could be nice to have a generic name so that we can combine automatically those experiments for data mining.

Relational databases - yes. Want something open source, and I'm sure @lpatiny https://github.com/lpatiny knows well what is available or in development. Am assuming that chemreg https://chembiohub.ox.ac.uk/blog/2015/03/10/introducing-chemreg-compound-registration-system.html is aiming for this, but I'm not sure. As you've said above, a huge advantage of the current use of a Google sheet as the front end for data entry is that the data entry is simple and therefore crowdsourcable.

We are doing many opensource projects going in this directly but they are not ready for production. I don’t want OSM to beta-test ;)

We have also the possibility to save NMR, IR, … information on-line. you may have a look at:

http://www.cheminfo.org/Spectra/NMR/Predictions/1H_Prediction.html http://www.cheminfo.org/Spectra/NMR/Predictions/1H_Prediction.html?viewURL=http://couch.cheminfo.org/cheminfo/eea0ba081ea2cc99da5c1aed2f29a0a8/view.json?rev=38-67e37694187cb13c3456e1a28eb6209b&v=v2.17.5

That shows the possibility to display / predict NMR spectra. You may also have a look on the automatic assignment on-line

http://www.cheminfo.org/Spectra/NMR/Peak%20picking/1D_peak_picking_and_assignment.html http://www.cheminfo.org/Spectra/NMR/Peak%20picking/1D_peak_picking_and_assignment.html

However it is still a little bit early for production.

Now as you know all our projects are open-source MIT or BSD license and if you are a computer geek and would like to spend 3 month in my group for an internship to contribute to the development there are possibilities. We could make this database together.

drc007 commented 9 years ago

This is looking more and more like a relational database. Ideally you would have a different spreadsheet/table for each assay with an experiment ID linked to the lab notebook page so you can go back to look at the data for the compound (or compounds) that were tested in a specific experiment. You need to think about how much granularity you want to implement.

mattodd commented 9 years ago

If by Relational Database you mean that each datum has provenance, then that's what I mean. The nature of OSM, that data can come from multiple sources, makes this almost inevitable, I think. Given that we're talking about hundreds of datapoints not hundreds of thousands I think the bio database here is manually manageable at the outset, at this stage. Let me start on the metabolic data and see how we go. If it's like walking through treacle we may have to rethink.

On 15 July 2015 at 23:02, Chris Swain notifications@github.com wrote:

This is looking more and more like a relational database. Ideally you would have a different spreadsheet/table for each assay with an experiment ID linked to the lab notebook page so you can go back to look at the data for the compound (or compounds) that were tested in a specific experiment. You need to think about how much granularity you want to implement.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-121607718 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

drc007 commented 9 years ago

How do you plan to accommodate occasions when a compound is tested multiple times in an assay, what happens if one result is an outlier. Would you be able to tell if other compounds tested in the same experiment had anomalous results? Will you include internal controls from the metabolic data? If not how do you compare assays run in different labs? How will you differentiate between different batches of a compound?

mattodd commented 9 years ago

5 x (Don't know). I think we can tread a pragmatic line in the database between perfect knowledge and summary data. The answers to these questions will be captured in the primary datasets that are posted (e.g. controls used in potency evaluations), and we include as much as we can in the database without it becoming overwhelming. So I would include in the bio database something about the source of the data (since different labs have very different habits) but would not include (for now) more granular data such as batch numbers in the bio database (though clearly that is important). Perfect knowledge would involve time as a variable too, but we just can't do everything. Compromise: include a description of assay. Reasonable?

On 15 July 2015 at 23:18, Chris Swain notifications@github.com wrote:

How do you plan to accommodate occasions when a compound is tested multiple times in an assay, what happens if one result is an outlier. Would you be able to tell if other compounds tested in the same experiment had anomalous results? Will you include internal controls from the metabolic data? If not how do you compare assays run in different labs? How will you differentiate between different batches of a compound?

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-121612199 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/chemistry/research/todd.html | W http://opensourcemalaria.org/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

mattodd commented 9 years ago

Tester: added in two new columns (J and K) for data from one lab, and added descriptions in the descriptor sheet that include a URL for where the data came from. Adding this in made much easier by having these compounds in the primary sheet in the first place, thank you Chris! More metabolic data entry tomorrow.

drc007 commented 9 years ago

Just remember it is easier to have extra data in the spreadsheet that you don't use rather than have to go back and add historical data ;-)

mattodd commented 9 years ago

To follow up on this issue of ballooning of the Google Sheet take a look at it now - lots of columns for various biological data, with headers used that correspond to specific assays in the Assay Sheet.

OK?

Shouldn't we merge the sheets into the same location, i.e. Assay sheet becomes sheet 2 of the main Google Sheet, or is it easier to extract data if they are separate? @lpatiny

drc007 commented 9 years ago

At the moment I'd keep them in different sheets, keeping the original sheet as the summary table. At some point you may want to have individual sheets for specific assays or a sheet that only has chemistry specific information (mp, spectra etc.).

cdsouthan commented 9 years ago

Latest Google data sheets look good, with a few caveats. I did some preliminary checks

1) http://www.ncbi.nlm.nih.gov/sites/myncbi/christopher.southan.1/collections/48338932/public/ From the 250 InChIs the url above shows the 167 that were mapped by the PubChem identifier mapping service https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi . The rest are (by absence of exact InChI match) thus not in PubChem. Of the 167, 155 have a ChEMBL match (but have not checked concordance with activity values)

2) Need to turn off the "AuxInfo" in InChI string generation as it screws up the PubChem mappings and the chemicalize.org conversions

3) I would have used the InChIKey column for the mapping but its not complete and seems a bit messed up with some InChI strings in the same column

4) From this summary sheet I'd dump the % inhibition column

5) Column H is useful for intial SAR look-see but you should trim to three sig figs (arguably only two are meaningfull).

6) For some reason chemicalize.org is not processing the full sheet (only 92 rows) but I can try instanciating it differntly

8) Suggest a) revise sheet please (then I can repeat and extend the analysis) b) for the structures with summary data suggest submit as a PubChemBioAssay (now easier than before, may be able to help) c) I'd argue for depositing all the rest of the structures in PubChem whether they have been made or are pending designs (but clearly labled as such) PubChem would have no objection to this small number of virtuals and their instanciation is valuble for global findablity by similarity search as well as exact match. Groups being able to openly see each others design phases months in advance of eventual data surfacing could be useful (and AWK its difficult to search for somehting you dont even know exists)

drc007 commented 9 years ago

How do we get the synonyms for structures included e.g. MMV639565

Cheers

Chris

On 21 Jul 2015, at 08:24, cdsouthan notifications@github.com wrote:

Latest Google data sheets look good, with a few caveats. Did a few preliminary checks

1) http://www.ncbi.nlm.nih.gov/sites/myncbi/christopher.southan.1/collections/48338932/public/ http://www.ncbi.nlm.nih.gov/sites/myncbi/christopher.southan.1/collections/48338932/public/ — Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-123197530.

lpatiny commented 9 years ago

Synonyms should be ideally placed in different columns … Synonym 1, Synonym 2, …

If you don’t want to add many columns because in 99% of the case there are no synonyms you could put all the names in the same column and separated them by

“; “

semi-column is never present in a iupac name and therefore it is easy to split this field to find all the names

On 21 Jul 2015, at 09:48 , Chris Swain notifications@github.com wrote:

How do we get the synonyms for structures included e.g. MMV639565

Cheers

Chris

On 21 Jul 2015, at 08:24, cdsouthan notifications@github.com wrote:

Latest Google data sheets look good, with a few caveats. Did a few preliminary checks

1) http://www.ncbi.nlm.nih.gov/sites/myncbi/christopher.southan.1/collections/48338932/public/ http://www.ncbi.nlm.nih.gov/sites/myncbi/christopher.southan.1/collections/48338932/public/ — Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-123197530.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-123202959.

drc007 commented 9 years ago

At the moment a search of PubChem for MMV639565 yields no results.

On 21 Jul 2015, at 08:55, lpatiny notifications@github.com wrote:

Synonyms should be ideally placed in different columns … Synonym 1, Synonym 2, …

If you don’t want to add many columns because in 99% of the case there are no synonyms you could put all the names in the same column and separated them by

“; “

semi-column is never present in a iupac name and therefore it is easy to split this field to find all the names

On 21 Jul 2015, at 09:48 , Chris Swain notifications@github.com wrote:

How do we get the synonyms for structures included e.g. MMV639565

Cheers

Chris

On 21 Jul 2015, at 08:24, cdsouthan notifications@github.com wrote:

Latest Google data sheets look good, with a few caveats. Did a few preliminary checks

1) http://www.ncbi.nlm.nih.gov/sites/myncbi/christopher.southan.1/collections/48338932/public/ http://www.ncbi.nlm.nih.gov/sites/myncbi/christopher.southan.1/collections/48338932/public/ — Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-123197530.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-123202959.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-123204863.

cdsouthan commented 9 years ago

Synonyms are a big issue that we need to keep on top of - before ending up with cross-mapping spaghetti that you can no longer unravel (you might try to retrospectively but any missmappings will have propagated around the globe in the meantime) but I will keep these comments breif.

1) There are MMV, OSM, ChEMBL and PubChem IDs you need to x-map (ChemSpider optional perhaps) 2) In the first instance lock these x-mappings down internaly but try to mininise the external surfacing (sounds paradoxical I know but there are reasons) e.g. suggest dont surface OSM numbers into PubChem SIDs. 3) ChEMBL have already surfaced many MMV numbers as synonyms in PubChem (I could count them probaby) but not MMV639565 - I guess the primary mappings were from OSM? 4) The usual name-to-structure complexities arise vis synthesis batch IDs and or preps with enatiomeric entrichment (e.g. the strory for http://cdsouthan.blogspot.se/2015/05/antimalarial-dot-joining-for-mmv008138.html)

drc007 commented 9 years ago

Thanks, your inside knowledge is invaluable. Regarding OSM identifiers, we may need them visible because they are probably the best way to link chemical information especially for intermediates that don't get tested in bioassays?

Sent from Chris's iPhone

On 21 Jul 2015, at 09:38, cdsouthan notifications@github.com wrote:

Synonyms are a big issue that we need to keep on top of - before ending up with cross-mapping spaghetti that you can no longer unravel (you might try to retrospectively but any missmappings will have propagated around the globe in the meantime). I will keep these comments as breife as poss

1) It looks like there are MMV, OSM, ChEMBL and PubChem IDs you need to x-map (ChemSpider optional perhaps) 2) In the first instance try to lock these x-mappings down internaly but try to mininise the external surfacing (sounds paradoxical I know but there are reasons) e.g. possibly dont surface OSM numbers 3) ChEMBL have already surfaced many MMV numbers as synonyms in PubChem (I could count them probaby) but not MMV639565

— Reply to this email directly or view it on GitHub.

cdsouthan commented 9 years ago

Fine - don't want to be overly pedantic about this. So, if any particular home-grown ID is useful for ROW (Rest of World) by all means get it out there. But I'd adivise only doing this via a carefully OSM-provenanced and x-checked set of PubChem SIDs. Crucuialy, you can then do any retro fixing (e.g. new names needed when enatiomers are purified and the activity splits signigicantly) but re-submit/refresh that de-facto mapping file in PubChem (and leave the CID naming heuristics to do their thing). AWK it will only be a matter of time before a Chinese vendor picks up an OSM ID that then becomes Google indexed (but they might even have the n2s correct until its revised...)

cdsouthan commented 9 years ago

JFTR we should check Google findability e.g. for InChIKeys and code names. A quick pop with UYWFTHNNGLIMAY indicates OSM surfacing but not directly from that Google sheet. Not sure re github. What we are writing here is not being scraped.

cdsouthan commented 9 years ago

Another JFTR but on-topic. How many structures from your current 250 are at least potentialy enatiomerically splitable or enrichable by experimental techniques (i.e. preps could be assayed) ?

mattodd commented 9 years ago

Anything chiral could in theory have any ee. #172

On 26 July 2015 at 22:52, cdsouthan notifications@github.com wrote:

Another JFTR but on-topic. How many structures from your current 250 are at least potentialy enatiomerically splitable or enrichable by experimental techniques (i.e. preps could be assayed) ?

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-124979297 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/people/matthew.todd.php W http://opensourcemalaria.org/ | W http://opensourcetb.org/ | W http://opensourcepharma.net/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

cdsouthan commented 9 years ago

So which algorithm can you run to count the chiral centres in the 250? Gotta be something on your Mac for this Chris ?

On Sun, Jul 26, 2015 at 3:06 PM, Mat Todd notifications@github.com wrote:

Anything chiral could in theory have any ee. #172

On 26 July 2015 at 22:52, cdsouthan notifications@github.com wrote:

Another JFTR but on-topic. How many structures from your current 250 are at least potentialy enatiomerically splitable or enrichable by experimental techniques (i.e. preps could be assayed) ?

— Reply to this email directly or view it on GitHub < https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-124979297

.

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/people/matthew.todd.php W http://opensourcemalaria.org/ | W http://opensourcetb.org/ | W http://opensourcepharma.net/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-124981366 .

alintheopen commented 8 years ago

Hi, can we add an extra ID column so that compounds with both internal chemist codes and MMV numbers can have both listed? Happy for me to add?

mattodd commented 8 years ago

Well, I've been adding all the codes into the same column. Not sure if that's a problem or not, provided we separate the values with something consistent like commas or semicolons. See my last point in the opening post of #354

On 10 November 2015 at 12:27, alintheopen notifications@github.com wrote:

Hi, can we add an extra ID column so that compounds with both internal chemist codes and MMV numbers can have both listed? Happy for me to add?

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-155253620 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/people/matthew.todd.php W http://opensourcemalaria.org/ | W http://opensourcetb.org/ | W http://opensourcepharma.net/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

alintheopen commented 8 years ago

ok, just think it is cleaner with MMV numbers separate, but whatever everyone prefers.

mattodd commented 8 years ago

I've no preference - if there's a good reason for one rather than the other in terms of @lpatiny 's system, then we go with that.

On 10 November 2015 at 12:35, alintheopen notifications@github.com wrote:

ok, just think it is cleaner with MMV numbers separate, but whatever everyone prefers.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-155254843 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/people/matthew.todd.php W http://opensourcemalaria.org/ | W http://opensourcetb.org/ | W http://opensourcepharma.net/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments.

madgpap commented 8 years ago

I'd keep them separate.

George

Sent from my giPhone

On 10 Nov 2015, at 01:43, Mat Todd notifications@github.com wrote:

I've no preference - if there's a good reason for one rather than the other in terms of @lpatiny 's system, then we go with that.

On 10 November 2015 at 12:35, alintheopen notifications@github.com wrote:

ok, just think it is cleaner with MMV numbers separate, but whatever everyone prefers.

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-155254843 .

MATTHEW TODD | Associate Professor School of Chemistry | Faculty of Science

THE UNIVERSITY OF SYDNEY Rm 519, F11 | The University of Sydney | NSW | 2006 T +61 2 9351 2180 | F +61 2 9351 3329 | M +61 415 274104 E matthew.todd@sydney.edu.au | W http://sydney.edu.au/science/people/matthew.todd.php W http://opensourcemalaria.org/ | W http://opensourcetb.org/ | W http://opensourcepharma.net/

CRICOS 00026A This email plus any attachments to it are confidential. Any unauthorised use is strictly prohibited. If you receive this email in error, please delete it and any attachments. — Reply to this email directly or view it on GitHub.

lpatiny commented 8 years ago

Indeed it is never good to put many information in the same column. Better to add a column with a specific header and an explanation of the header in the other document.

alintheopen commented 8 years ago

I've added the Internal ID column. So we also need to add IUPAC name as a column too?

wvanhoorn commented 8 years ago

There is some redundancy:

lpatiny commented 8 years ago

You may easily test SMILES on http://www.cheminfo.org/Chemistry/Cheminformatics/Smiles/index.html

SMILES are not cannonic so indeed 2 different SMILES may represent the same structure so you may pick up any of them. (seems the google docs was changed because now both SMILES are identical)

image

cdsouthan commented 8 years ago

1) Any progress on the compound registration system as robust replacement for the mastersheet? 2) I don't want to give anyone unecessary work but I do suggest using the Holy Quintet (SMILES, both InChIs and the IUPAC) in the Master Sheet (MS) and checking/locking down the round-tripping between them 3) As @wvanhoorn points out keep a close eye on referencial integrity/consistancy withing the MS of which duplicate checking is the minimum

4) I suggest you document your processes of "mappings" which generally fall into three types
a) systematic structural specifications that (on a good day) should algorithmicaly interconvert (i.e. the quintet) and be able to generate SD files and render images (i.e. to complete the Holy Heptet) b) public database IDs (e.g. PubChem, ChEMBL, ChemSpider) c) local database IDs (e.g. OSM and MMV).
When you line these up across your MS rows its important to record how you are making the x-mappings. For example on what basis do you decide OSM-x = MMV-y? Since you dont have MMV's registration system db to extract their local structural specifications (maybe ask them for secure WebServices access to do exactly this?). The public databases provide various mapping services you can use to generate and check the IDs (and to submit if they are not there) but best also to record exactly how this was done. Manual checking is OK but will be more error prone as you move < 500 cpds.

lpatiny commented 8 years ago

We are currently working rather on the lab-notebook because it seems to me that the spreadsheet is a straighforward way to process the data without any knowledge while many cheminformatics properties are calculated directly on cheminfo when loading the data. Actually InCHi and InCHiKey could also be calculated from the SMILES. What would be the main advantage of another system ?

cdsouthan commented 8 years ago

I'm sure the team want as few systems as possible but its usual for a chem-bio ELN to hand-off to a chemical registration system (e.g. https://www.chemaxon.com/products/compound-registration/) since the utilities diverge. However, we have chewed over the requirements at the begining of this post already (and it looks embarassingly like I have repeated myself on occasions...). If the Googledocs delivers the teams needs thats fine - but try to bake in the Quintet automatically. There will still be some columns that will have to be manually filled (e.g. MMV nos)

mattodd commented 8 years ago

Thanks for spotting the duplicates @wvanhoorn . Very useful to have this kind of integrity check.

@alintheopen is OSM-S-317 a very recent assignment? Can you check this out please, or ID the assigner? Could you also please verify that the four MMV compounds below have duplicates and delete one of each?

" OSM-S-220 and OSM-S-317 are duplicates: smiles are different, InChiKey and structures are the same Double entries for: MMV670767, MMV669543, MMV669850 and MMV669849"

alintheopen commented 8 years ago

OSM-S-317 replaced with new compound: InChI=1S/C12H6ClF3N4O/c13-9-5-17-6-10-18-19-11(20(9)10)7-1-3-8(4-2-7)21-12(14,15)16/h1-6H

ClC1=CN=CC2=NN=C(N21)C3=CC=C(C=C3)OC(F)(F)F

Ta @wvanhoorn

alintheopen commented 8 years ago

MMV670767, MMV669543, MMV669850 and MMV669849 are all indeed double entries. Duplicates have been deleted.

cdsouthan commented 8 years ago

Apols as obvious but have you run the Exel compands for duplicate counting and highlighting on every column? N.b. synonym duplication viz OSM-S-62 PMY46; PMY58, OSM-S-63 PMY58

alintheopen commented 8 years ago

No I haven't yet.

On Monday, 7 December 2015, cdsouthan notifications@github.com wrote:

Apols as obvious but have you run the Exel compands for duplicate counting and highlighting on every column?

— Reply to this email directly or view it on GitHub https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/319#issuecomment-162454806 .