OpenSourceMalaria / OSM_To_Do_List

Action Items in the Open Source Malaria Consortium
82 stars 13 forks source link

Meeting Discussion Point 6: Interacting with Data #393

Open mattodd opened 8 years ago

mattodd commented 8 years ago

Background to Meeting on May 24th 2016 (#386).

Question: The OSM consortium uses a number of different tools to disseminate and interact with project data - are these all effective? Question: What new methods of collaborating on project data could be tried that would be accessible to the widest number of OSM participants? How can we eliminate barriers?

Resources:

OSM Series 4 summary wiki Lab books: Labtrove, Labarchives (Ed Tse, Chase Smith). See the Biological Evaluation ELN in particular for this Issue. Compound Database: All OSM Compounds in Google Sheet, interactive view on Cheminfo, direct download of Series 4 SDF.

Comments/suggestions below are welcome, as well as (briefly please) during the main webinar meeting on the 24th and its follow-up on June 8th (#394).

Relevant background:

Series 4 Summary (a wiki). Can be edited by anyone, and has revision history. Disadvantage: no public Dropbox of associated files, so people can't easily tweak.

OSM Compound Spreadsheet. Google sheet: advantage: can be edited by many people, and used to generate other outputs, e.g. visualization on the fly by Cheminfo. Sister sheet describes assays employed. Cheminfo can be used to generate an SDF (here is downloadable version). See also Chris Swain's slides on data in OSM, including how to use Vortex and iPython to visualise the data.

To Do List and discussion area. Useful platform, but no daily summary option yet? People do not always update this with most current activity. Need account to comment. Powerful platform that is open and well-supported.

Labtrove Lab Books: Series 4 synthesis, biological evaluation. Fairly old platform, but open source (itself) and simple. Can't be structure-searched, onerous to add cheminformatic strings to each entry. Is Google-indexed. Labarchives Lab Books: (Ed Tse, Chase Smith). More advanced, still no structure searching. Not indexed by Google? Luc Patiny trial ELN: being evaluated (OSM needs volunteers). Would "understand" chemical content.

OSM Landing Page: http://opensourcemalaria.org/ - auto-populated content only, intended to bring people in from Google searches. Online Meetings: Recorded and placed on YouTube. Lighter aspects of dissemination: Twitter used a lot. Facebook less so. Github has taken over from G+ as a discussion forum. Linkedin not at all. There is a last-resort email address, and T-shirts can be bought, with funds going to the project.

drc007 commented 8 years ago

A quick comment. The Google doc spreadsheet is not a database, it is a simple spreadsheet. The big advantage is that it was trivial to set up and populate. There are of course several potential advantages to having a proper online, chemically intelligent database. If anyone would like to set one up that would be great, perhaps an honours student project? Would be a publicly accessible entry on your CV ;-)

cdsouthan commented 8 years ago

As mentioned before, no other route of chemical structure and data surfacing has the speed, global connectivity and mineability of PubChem BioAssay. Submissions to ChEMBL that eventually surface in PubChem are good but have disadvantages. So far, only 125/240 of the OSM structure list have exact matches (https://cdsouthan.blogspot.se/2015/05/entity-resolution-for-antimalarial.html). Some of 115 missing structures are either being synthesised or waiting for data. However, there is a discoverability argument for having all structures in PubChem, even the virtual designs (but tagged as such). Could discuss and, JFTR we would be welcome by PubChem

mattodd commented 8 years ago

Yes, to my mind the Google Doc is an excellent primary source of data. Easy to input, easy to edit and is backed up. I think we're all making the assumption that the data can be ported into any number of other places, such as a proper database. But of course we all want there to be a single place where we enter data. @cdsouthan yes, I know you've made this good point before, and it remains a good point. My issue is just the word "submission", implying a manual process. I'm sure you've answered this before, but is there a way we can auto-submit from the Google Doc (or something derived automatically from the Google Doc that has the right format) into Pubchem or Chembl? @madgpap That would be powerful.

cdsouthan commented 8 years ago

OK, let me know when first-activity results go into the GD sheet for newly synthesised Series 4 strucs that are PubChem -ve. I will then try to check out the improved assay submission system. This will allow us to asses auto-parsing (more or less Excel reformating) feasibility for piping direct to PubChem BioAssay. However, In the first instance I will see how I get on with manual input

https://pubchem.ncbi.nlm.nih.gov/upload/docs/upload_help_complete.html