OpenSourceMalaria / OSM_To_Do_List

Action Items in the Open Source Malaria Consortium
82 stars 13 forks source link

Series 4 paper #502

Closed cdsouthan closed 7 years ago

cdsouthan commented 7 years ago

Any plans?

mattodd commented 7 years ago

Lots. Specifically: I think I concluded, after speaking about this in some detail with @greglandrum last week over a beer, that we need to write the paper on Github, by actually using Github properly. It seems to solve many problems we currently face in putting the paper together effectively (file sharing, managing edits, handling images and raw image files), while creating a slight barrier (we have to learn a couple of basic operational techniques around pull requests and what they mean). More on this as soon as I can. Taking the broader view, the impediment to this paper happening is my lack of time, and I need to take steps to remove that as the impediment. @edwintse is, in the background, assembling the experimental part, but (because of the nature of some of the operational issues we have) has not yet shared the file. That too needs fixing ASAP.

cdsouthan commented 7 years ago

Fine, no panic on the Titanic. I can engage by refreshing PubChem checks as and when (assuming I might have the honour and privilege of joining the merry band of authors again). I could even submit structures not already in there, even to BioAssay as well as just SIDs. As mentioned before, this should be arranged at least a month in advance of submitting the paper so that PubChem completes all its heavy-lifting relationship pre-computing (i.e . our leads against the rest of the 91 million, including 2D and 3D) that we can then just browse and report.

greglandrum commented 7 years ago

If/when you guys go down this route, feel free to ask if you want help on the git/github mechanics.

I'm planning on trying a similar experiment for a (still hypothetical) RDKit paper in the not-too-distant future.

cdsouthan commented 7 years ago

Not so sure about grappling with github as team editing platform (who do we know who has already used it happily for this?) when there are simpler options such as http://f1000.com/work/ and others

MFernflower commented 7 years ago

sorta unrelated but my real name is Anthony Sama

im so excited my name might be in a paper!

mrwns commented 7 years ago

I recently had the pleasure to participate in writing a review on github, with PullRequests, CI to build the pdf and check format violations, and the contributions written to a blockchain for timestamping. Maybe some of the techniques could also be useful here!

this is the repo: https://github.com/greenelab/deep-review

holeung commented 7 years ago

Has anyone tried ACS ChemWorx (https://hp.acschemworx.acs.org)? It's free and supposedly designed for collaborative writing. Being from ACS, I would hope it would have nice support for chemistry schemes and figures.

mattodd commented 7 years ago

Yes @mrwns - I saw the same thing when I met Casey at a conference. This, plus other conversations with @greglandrum convinced me we need to use Github better, as it seems to possess all the functionality we need. Previously I'd been unconvinced because the file-sharing functionality seemed to require someone to hold a master repo of all files on their local machine, to which everyone ultimately needed to sync, but it seems like that is no longer the case and we ought to be able to use Github as a kind of communal Dropbox, sharing files for the paper (e.g. .cdx files etc). Versioning is taken care of, and there is an automatic tracking of who has contributed what. Unless I'm missing something, it has everything we need. Anyone with lots of Github experience care to comment?

If we did do the move towards Github, we'd likely need some patient experts to shepherd us all a little at the outset, until we can draft up some simple how-to's.

cdsouthan commented 7 years ago

From recent comments I assume there is no impediment for drafting the S4P right now, the old fashioned way (pending a sophisticated Github workflow for S5P etc). Since this directory (OpenSourceMalaria/OSMSeries4Paper1) is 2 years old, suggest purge and re-start. A simple Google doc would do fine to get going with. No strong opinions but pitch towards JMedChem? (they will take sup dat as SMILES > activity file, NOBA gives us a fast track > SciFinder :)

mattodd commented 7 years ago

Knock yourself out at https://github.com/OpenSourceMalaria/OSMSeries4Paper1 if you want to see the current files and the current Word doc. As I mention over at #507 I think there's huge value in making this paper happen on a platform that is better than GDocs. I'm going to try to speak with @greglandrum and @miike about feasibility in the coming days, so wait a short while on this.

However, I see an easy compromise solution, too. One of the biggest components of the paper is the chemistry experimental file, consisting of molecules we've made as a community, and molecules we've inherited. It's very large. @edwintse and @alintheopen have been working on this off and on, but have not shared the file anywhere, I don't think. A possibility here is that we go Old School on this file, as a tester to see if this helps resolve remaining issues. Either sharing a Word file plus other things via Dropbox, or importing the document into GDocs. There will be issues with both these approaches for a document as large and complex as this for which there will be so many little iterations, special characters and embedded images. But it means we're working on the largest of our headaches, and of course as the most data-rich part of the paper it means we're also working directly on the chemical data management early on, re thinking about how we make sure the data are most discoverable. Thoughts please on this, and how best we collaborate on this bit in this way.

cdsouthan commented 7 years ago

Fine - but - our mooted 21'st century result collation/writing platform will still have to I/O into the 19th century journal publications system (i.e. converted to a hamburger and not an outlink in sight). I suggest we start with a good old 19th century to-do list on GD for prioritising what goes in, looking at it more from the outside (i.e. what would the readership and likely follow-on/up community most need, run a poll even?) It sounds like we have a lorra data which means filtering and/or splitting. I took a look at the draft mentioned above so that's at least a starting point. I guess the compound sheet is being updated and eventually the S4 bits could be dropped into figshare as sup dat with all the PubChem links live.

miike commented 7 years ago

Jumping in a little late here but I think I've caught up.

The first thing that jumps out is a need for a different format for the paper that's not a Microsoft Word document. There's a few issues with it at the moment

@mrwns has posted an example of writing a paper using Markdown. There's a lot of advantages of this (Github rendering, easy diff changes, being able to render and host on the web) but there may be some disadvantages:

For large files Github has a service called large file service (LFS), it's typically designed for graphics, audio files etc (quite popular for game development) but could be extended for something like this. It can help a lot in keeping repositories small so they can be cloned/downloaded quickly.

drc007 commented 7 years ago

I suspect that two of the most desirable requirements is the ability to embed chemical structures within the document in a format such that they can be later imported back into the chemical drawing package and edited. And secondly the ability to manage references. I had a look a while back and did not find anything suitable but if anyone has done a recent review I'd be really interested.

cdsouthan commented 7 years ago

Who is in charge of the Crown Jewels these days (a.k.a. the Master List)?

https://docs.google.com/spreadsheets/d/1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc/edit#gid=510297618)

There are various optimisations/enhancements we could contemplate as we move forward on the S4P (we could open this up as a new topic or leave it here). Three were already tweeted round a) put a simple version date right up in the title and b) set up Google analytics to count accesses as a find-ability metric and c) add an auto-call-out to Google for the inner layer Key (e.g. https://www.google.com/search?q=ISCYIQSGKXBNIC) as an extra row in the sheet. This one nicely whacks Kimberly's lab book which gives the @MedChemProf team structures at least some findability (pending a PubChem submission). The bad news is this means the Master List is not being crawled by Google.

For the record, today's pop (3rd July) of 298 S4 InChIKey from the Master List gave 154 matches in PubChem (live list at https://www.ncbi.nlm.nih.gov/sites/myncbi/christopher.southan.1/collections/52883952/public/). Only 120 of these are over 300 Mw so I assume the rest are intermediates rather than from the SAR series. This ties in with the 118 coming in via ChEMBL in 2014 (https://www.ebi.ac.uk/chembl/doc/inspect/CHEMBL3137547) and eventually as a PubChem Bioasay (https://pubchem.ncbi.nlm.nih.gov/bioassay/1079930). There are aspects of all this we could discuss but lets get the basic enhancements above going first (including getting the list crawled).

drc007 commented 7 years ago

@cdsouthan "Who is in charge of the Crown Jewels these days (a.k.a. the Master List)?" we are very egalitarian group, so no-one is in charge. Although it does appear there are Google doc preferences somewhere that limit some users.

drc007 commented 7 years ago

@cdsouthan I'm not an expert with Google docs but the top row should now show when the sheet was last updated. There may be better ways to do it.

drc007 commented 7 years ago

Just came across this. "Google and other search engine bots are completely blocked (via robots.txt) from indexing content hosted on Google Docs even if the owner has made his or her documents public. " Any thoughts?

cdsouthan commented 7 years ago

In terms of "who's in charge" I expected an answer on those lines! The egalitarianism here is of course a very positive and important aspect of the OSM enterprise. Nonetheless, I think a central result master list for an increasingly complex project should have governance in the form of one or at most a few experienced individuals who make the effort and take responsibility (and credit of course) for ensuring a) updates, b) collated data entries are cross-checked (e.g. by two people) c) perform various other housekeeping duties such as round-tripping checks, assigning new IDs where necessary, checking for gaps, requesting assay re-tests, highlighting front-runners etc. Logically, whoever is chosen as esteemed first (or equal-first) author(s) on the S4P could be the ones "in charge". AWAK none of this will completely prevent errors but experience tells us the likelihood of these creeping in factors approximately according to how many folk have edit access to such a sheet.

mattodd commented 7 years ago

@cdsouthan - we have to make the most of what resources we have. If things need doing, people should propose these here for checking and then go ahead and do them themselves if they have the capacity. In the absence of salaried staff dedicated to the task, this is the best solution. If someone repeatedly demonstrates proficiency and has repeated capacity to do those things, they are de facto manager of that aspect of things. I know it's unusual. Wikipedia has recognition of community super-users through various mechanisms. We could adopt something like this in theory if it simplifies OSM's structure and expectations, but titles are like cheese - they can become toxic if not refreshed through regular re-evaluation.

@drc007 - if the Google Master sheet is not indexed by Google, then that's surprising and unfortunate. Does it need to be mirrored somewhere that is indexed?

mattodd commented 7 years ago

But further to your excellent points above, @cdsouthan , I'd suggest splitting each into a single manageable task and posting a separate issue that can be resolved through specific actions. Very useful. Yes, there are synthetic intermediates in the Master List.

cdsouthan commented 7 years ago

OK, issues mutually understood (and ta for the compliment) but you raised an extra factor. Its none of our business as to exactly who is on the payroll for what in Sydney but if there are funders involved (even for collaborators) they should be expecting (mandating even?) a data management plan, of which the master-sheet has become a de facto core. Thus, putting a salaried person "in charge" of that (with due kudos) , makes sense.

On more pragmatic aspects, the simplest fix for getting crawled is to replicate the sheet as a web page anywhere on the OSM site (and then Google test it). I can also put the S4s up on my blog sometime where they get indexed before I have finished posting.

Yes I can collate issues for the master list (shall I do that here or move to a new topic?) but It would be good if the putative senior authors for the S4 paper could start fleshing our their outlines in // since we must obviously synch the requirements of the former to the latter

drc007 commented 7 years ago

@mattodd @cdsouthan I suggest we move issues about indexing, master list etc to a new topic and leave this topic to the Series 4 paper.

cdsouthan commented 7 years ago

Fine

mattodd commented 7 years ago

Closing since paper writing can begin as described in #532