Open miike opened 7 years ago
I'm not sure that is the solution, see https://www.labnol.org/internet/your-google-docs-in-search-engines/9590/
"because Google and other search engine bots are completely blocked (via robots.txt) from indexing content hosted on Google Docs even if the owner has made his or her documents public. "
I guess we want public read access, but project members read/write/edit?
I think that link is just out of date. The current robots.txt doesn't have any explicit disallows for sheets.
Yes on the public read access (which will fall under 'Public on the web - Anyone on the Internet can find and view') and then adding write access to project members.
OK, I've set the document to..
@cdsouthan can you see if it is now indexable?
I don't think the "Last Updated" field is working the way we want, as far as I can tell it is updated everytime anyone accesses the sheet, not when it is updated. I think a better solution would be to add an updated date column, and then have a script that checks when a row is edited. something like https://webapps.stackexchange.com/questions/37408/auto-updating-column-in-google-spreadsheet-showing-last-modify-date
However, I've never done this sort of scripting...
Arguably the most important parts of the sheet we need the Google bot to scrape are the InChIKeys (see https://www.ncbi.nlm.nih.gov/pubmed/23399051)
This can be used as a test query https://www.google.com/search?q=ISCYIQSGKXBNIC
its from
The one and only Google pop result is
As of now (11.17 CET) its not hitting the sheet ipso facto its not crawled yet
JFTR who are you @miike ?
@drc007 Does the revision history (File => See revision history) do this already? I think one caveat is that it appears that only people with certain permissions (>view?) can view revision history.
@miike I think the desire was that anyone could see if there had been any updates, and ideally what had been updated? The suggestion was to get people to amend the updated date but I can see people forgetting to do this.
@miike @cdsouthan I've linked the Google doc on a page on my website and then submitted the page for Google indexing, let's see if it turns up.
@drc007 Great, hopefully it'll appear in the search index within 24-48 hours. I'm not sure how Google works out the priority in terms of indexing docs.google.com
content.
Re: logging updates - this is definitely possible using Google Apps scripts but can be a little awkward as you've got to determine the logging format etc. It may be easier to cheat and use Tools => Notification rules to send an email to an inbox that posts automatically somewhere (maybe even back to the repository?).
@cdsouthan I did some work with @mattodd on OSM around Labtrove and some automation components many moons ago during my undergrad chemistry at Sydney. I now work in the software/data analytics space.
OK @miike (I was hoping you were still on the payroll so @mattodd could appoint you as Master List Master :) JFTR I was not advocating anything sophisticated on the versioning side. Simply that anyone viewing and downloading the sheet for local analysis (or paper writing) would just be able to see, if they came back a month later, that there were a few new compound rows, rather that flagging exactly which cells had changed. For the privileged editors this could be simply getting into the habit (i.e. not forgetting) of writing a new date in the title when new structures or results are entered.
@cdsouthan I've got a test script that I'll upload into the spreadsheet shortly. It currrently outputs
edit time (UTC) | edit time (local, Sydney) | cell edited | value added
anytime anyone makes any modifications in the first column. Is there anything else that needs to be added? Caveat: it doesn't seem like it's possible to get the edit user easily due to sandbox permission issues.
@miike Sounds like just what was wanted
Good, @miike lets see when @alintheopen and the other team chemists pop the next strucs in
BTW - any eventual chance of an inner InChIKey Google call-out row? e.g.
https://www.google.com/search?q=ISCYIQSGKXBNIC
While this is still not self-hitting the latest sheet (over 24 hours on) note this older lead hits an older version of a GD sheet as well as PubChem and my blog (but beware the lower hits are dark-side re-cycling of my blog posts)
This can be used as a test query https://www.google.com/search?q=ISCYIQSGKXBNIC
@cdsouthan @miike I'm not seeing any hits?
Well, whadaya know @cdsouthan @miike - yup the result from Wed https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/511#issuecomment-312824575 has dissapeared ! Only thing I can think of is that the Chase lab have (now) blocked the bot
I've added in the function for Edit History on the spreadsheet so any edits in the first column of the first spreadsheet get logged to the second spreadsheet 'Edit History'. There seems to be a known problem with double triggering so I may need to look into implementing locks.
@cdsouthan I have not made any recents edits or changes to sheet. No changes to my ELN also. https://mynotebook.labarchives.com/share/MCPHS%2520MedChem/MzgzLjV8MzY3Ny8yOTUvVHJlZU5vZGUvMjYyNTI2OTc5OXw5NzMuNQ==
Thanks @MedChemProf - so that's just another example of good old Google capriciousness ........
This appears to be more consistent (this week at least..)
https://www.google.com/search?q=PMIWBIXSAYKRGF
and so indicates the InChIKey Google pops still have general utility
But just to bamboozle us (Friday 13:00)
@cdsouthan @miike @mattodd I've been doing a few experiments. I looked at a handful of the records and I can't find any hits using Google. However using Bing I get a number of hits, including google docs sources.
I find more hits when I use InChI e.g.
InChI=1S/C20H16F2N4O3S/c1-30(27,28)15-5-3-14(4-6-15)20-25-24-18-11-23-12-19(26(18)20)29-9-8-13-2-7-16(21)17(22)10-13/h2-7,10-12H,8-9H2,1H3
search using Bing finds many hits including https://docs.google.com/spreadsheets/d/1VwAvTA0VfRVBMZ9WqEMPaG3YF7-SqBAgjS95NkXJkaE/edit#gid=0
Google finds only single hit.
How do we tell Bing to explicitly index OSM pages?
@cdsouthan @miike @mattodd OK, found out how to submit to Bing, let's see how it turns out.
I think we ( @miike @drc007 @mattodd et al) need to step back a little to consider what we are trying to achieve in the specific case here of getting (ideally only the latest version) of the OSM Master List (ML) crawled/indexed by the major search engines. This is related to the wider issues of getting ELNs globally exposed and at least partially interconnected, as we wrote about in our SCINDR article https://riojournal.com/article/9995/ but we unfortunately now have to pursue in its absence :(
The basic utility is to be surfaced/findable/discoverable by human interested parties who may not know what we do but could contribute, even in small ways such as just getting the word around. While Bing has some advantages over Google (including being "cleaner") we have to focus on the former as its the default for most folk. For discoverability AWAK the problem is the restriction to string match. This means interested parties have to know exactly they are looking for in the first place, e.g. in terms of an OSM ID, MMV ID, or a connectivity layer InChIKey (given that SMILES and InChI strings give only partial unreliable matches most of the time and full InChIKeys can bring back false-positives). Thus de-novo discoverability via Google is somewhat illusory and global chemical similarity discoverability is only realistically achievable via PubChem (for which we have a lag problem).
Notwithstanding, there is utility for the project in getting the ML crawled by the Googlebot and I think is worth aiming for (and credit to the efforts being made above). The main utility is probably for those inspecting the ML to quickly get synthetic details without have to know where to navigate to/dig it out of the different team ELNs. The connectivity keys from these recent @MedChemProf outputs are good examples.
ISCYIQSGKXBNIC (G -ve but Bing +ve!)
WXNSEGHVFXDZGF (G+ve)
SMRNZXNRFZNBAW (G +ve)
There are also informative paradoxes such as YFUIAROYVNJDHR having 2 OSM-only hits. The corresponding MMV670763 ID has 27 hits but also OSM-only, indicating not only that some chemists are not adding InChIKeys to their molecular definitions but also that the MMV primary repositories are not being surfaced/crawled (presumably because they include proprietary structures). In addition the hits in this case indicate most surfacings are secondary in coming from http://onsnetwork.org/about-onsnetwork-org/ via the recycling of OSM pages. NOBA, PubChem is negative for the ID and the InChIKey
@cdsouthan @miike As far as I can tell the google doc spreadsheet is not being indexed at present. I can find multiple pages that link to the document using both Google and Bing, but no hits for the actual spreadsheet itself.
Quickest solution is for @drc007, myself, or anyone in fact, to paste the latest version into a (crawled) blog page with an associated back-pointer. Its not ideal to make "secondary surfacings" in this way but we have to live with this anway, including old, bot-cloned and different lists (I plead guilty from my old blog posts)
@miike @cdsouthan @mattodd Should we think about having a regular monthly/weekly export to a web page?
Good idea @drc007. Monthly sounds about right (with skipping if no new data)
@miike @cdsouthan @mattodd OK I've created a page by hand, we can look at automating if this solves the issue with indexing.
If you could give it some publicity it might help.
Quick work @drc007. Yes I can get word around but lets monitor for a while first. Looks like they are just coming though :) Could I ask for QPQ if you could be so kind as to discretely mention/link PMID: 23399051 in your post? I'll probably tweet/LinkedIN next wed (peak time...)
@cdsouthan Done ;-)
Appreciated, @drc007 but to exceed the beer-owing threshold any chance of migrating the mention to the top of the page rather that somewhat buried below the big list?
Can do later. BBQ takes priority at present :-)
@cdsouthan Now edited
Interestingly the section down the bottom of the wiki pages was meant to achieve what is being described above. Woefully manual, but it ought to be picked up? If we migrate wiki content to Github might it be easier to export the Master List strings to a page there, automatically?
It does seem curious to have to do this. Is there not a way to "mirror" a spreadsheet somewhere else that is indexed?
I confess to being bamboozled by, at least for InChIKey matches, what Google surfaces, from where and when or why. However, despite these grumbles nothing else can tell us exactly what's where in 0.3 secs (PubChem Entrez times out almost as often as it returns a result these days ). For example, we can pop the first Key from the wiki pages and see the below that includes an older GD sheet. N.b given your work to make the wiki pages nicely readable @mattodd please could you just put the date right at the top when you last edited it?
But wouldn't it be nice if we had a small - 'erhem - database - where we put all the structures and results with date stamps (even just manually) when they were entered and appropriate URLs to metadata?
JFTR
But wouldn't it be nice if we had a small - 'erhem - database - where we put all the structures and results with date stamps (even just manually) when they were entered and appropriate URLs to metadata?
Sorry, don't know what you mean? If we adopted an ELN like @lpatiny 's then searchability of structures would be solved, and presumably there would then be a way of compiling all the data into i) a compendium of synthesis data ("Show me all the attempts at making molecule X") and ii) a compendium of biological data (RDF triples, or something, of compound linked to property Y).
Is that what you mean by "database"? If not, then the following is off-topic:
The second of those (compendium of biological data) is harder. At the moment, manual entry of the biological data into the Master List seems to be the only way to handle things. But I wonder if we shouldn't be making it easier to do this by creating a Google Form so that anyone can:
a) Add a molecule to the Master List b) If the molecule exists in the Master List already, one is alerted to that fact and one can add data to the entry c) If the molecule does not exist already, an entry is created, along with the relevant strings and an OSM identifier can be created. Biological data can be added. (manual entry of strings, and creation of an OSM number, are trivial yet still a headache).
The reason why this is hard is that all the biological data that we might want to enter is super-context-specific, by which I mean each time bio data is entered, there also needs to be a link to a relevant assay and a page containing the raw data. No easy solve for that.
Happy to defer discussion of this to another time and place, and revert back to what we're meant to be discussing, which is how to ensure the Master List is crawled.
It does seem curious to have to do this. Is there not a way to "mirror" a spreadsheet somewhere else that is indexed?
Am I missing something - in Google Sheets, don't you just go "File -> Publish to the Web" and keep "Published content & settings -> Automatically republish when changes are made" checked? Could set up a URL redirect or similar from a more reader friendly URL and make sure that is crawled?
@mcoster That is what we have done, and the page that directs to the Google sheets has been indexed by Google. However it may be that Google does not index google docs themselves.
Wait, now I'm sitting in the same confused train that @mcoster is on. If there is a public page that is mirroring the sheet, and if that is being indexed, as you say, then...? Why are we not all happy? Is it because a visitor to the page that was indexed by Google cannot navigate back to the source? (In answer to an earlier observation re Bing. We need to aim for Google. If Google's not indexing our molecules we have an issue to solve).
@mattodd The issue is that whilst we can have a public page that mirrors the google docs spreadsheet it currently has to be created by hand. This raises a couple issues.
I used Bing to show that pages were visible to a search engine, even if Google did not find them.
I think we can sleep at night, as exploratory Google pops now indicate a high incidence of findability somewhere or other. . However @drc007 may not be blissfully happy in that that he now has to remember/be most kindly prompted to refresh the mirror sheet now and again. Given the currently modest acquisition rate of new structures and results this would not seem to onerous (a monthly check?). Future automation of structure/results/data capture/PubChem submissions etc etc .. is a different topic
OK. And if someone does find that page via a Google search is it obvious where to go to for more? i.e. to find the source? We can include the need for a periodic refresh in any future "Tech-Ops" page.
@mattodd The page is here http://www.cambridgemedchemconsulting.com/news/index_files/c0543b419406d9e3eb6c6396a816012f-278.html#unique-entry-id-278
It will be automatically reindexed by Google and Bing if updated.
It still doesn't look like Google has indexed the document on Gdocs which is shame. What do people think about mirroring the master list on Github as a markdown table? (we'd have this setup to be automated so it'd pull every 24 hours or so) This would have limited interactivity but will provide something visually similar and has a much higher likelihood of getting a higher priority from Google for indexing.
OSM Number | Internal ID | PubChem CID | MMV ID | SMILES | InChI | IChI Key | Series | Assays | PfaI EC50 (Inh) | Pfal IC50 (GSK) | Pfal IC50 (Syngene) | Pfal IC50 (Dundee) | Pfal IC50 (Avery) | Pfal (K1) IC50 (Avery) | Pfal IC50 (Ralph) | Pfal IC50 (Guy) | Pfal (K1) IC50 (Guy) | Pfal IC50 (Batra) | Pfal (K1) IC50 (Batra) | Pfal (3D7) IC50 (Broad) | Pfal (Dd2) IC50 (Broad) | PfaI EC50 uMol (Mean) Qualifier | PfaI EC50 uMol (Mean) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OSM-A-1 | 1233035 | O=C(/C(S/1)=C/C2=C(C)N(C(C)=C2)C3=CC=C(S(=O)(N)=O)C=C3)NC1=N\C4=CC=CC=C4 | InChI=1S/C22H20N4O3S2/c1-14-12-16(15(2)26(14)18-8-10-19(11-9-18)31(23,28)29)13-20-21(27)25-22(30-20)24-17-6-4-3-5-7-17/h3-13H,1-2H3,(H2,23,28,29)(H,24,25,27)/b20-13- | ODRSSOQWOHNABY-MOSHPQCFSA-N | 1 | 3.05 | 4.379 | 3.7145 | |||||||||||||||
OSM-A-10 | O=C([H])C1=C(C)N(C(C)=C1)C2=CC(OC)=CC=C2 | InChI=1S/C14H15NO2/c1-10-7-12(9-16)11(2)15(10)13-5-4-6-14(8-13)17-3/h4-9H,1-3H3 | XTPVHYRCFNPCRL-UHFFFAOYSA-N | 1 | #DIV/0! |
JFTR @mcoster your excellent video https://www.youtube.com/watch?v=X_tREXdHRE4 is relevant to a lot of above
@miike I suspect the reason for G not indexing (or at least taking a long time to) what OSM puts on GD is the same as for anything else. There is very little traffic, inlinks or outlinks. By comparison @drc007 's http://www.cambridgemedchemconsulting.com/ gets a lot of traffic, and has many links, ipso facto gets crawled more. NOBA https://cdsouthan.blogspot.se/ also gets crawled for new posts very quickly
But, my experiment with updating an older post with an extract from the ML took much longer (and CMCC outranks me)
Note also a certain irony with the latest surfacings from the pop below. The activity right here on this issue has not only been crawled but (presumably because of the traffic and outlinks) the mention at the top of this thread now gets the highest ranking!
I would therefore not bother doing much else just now, beyond promising @drc007 a beer for the occasional mirroring.
The answer to @mattodd 's Q "if someone does find that page via a Google search is it obvious where to go to for more? i.e. to find the source?" strictly speaking, is no, since there are no metadata outlinks on the master sheet. Notwithstanding, there are for most InChIKey or code number pops enough matches to various ELNs for someone be able to drill down, or at least find out who to call. Note also, since all this depends on exact matches, the person doing the pop, by definition, has already found a starting point.
JFTR @mcoster your excellent video https://www.youtube.com/watch?v=X_tREXdHRE4 is relevant to a lot of above
Thanks, I've been loosely following this Issue and been loaded with teaching the last few days. As a newcomer, it is taking me a while to get to grips with how current issues fit in the overall context, and I haven't been involved in the Master list yet. From the standpoint of training my project students how to write their ELN entries, I will be asking them to paste in InChI strings, InChIKeys and SMILES - seems like having all three will be good for discoverability.
@miike what does this master sheet edit history mean? Lots of cell changes (what type?) in early Aug but duplicated additions on 4th? So this is the Aug 4th 15.46 version right?
As discussed in #502
@mattodd @drc007 I believe this is as the spreadsheet in question has sharing settings set to 'Anyone with the link can view' when it should be 'Public on the web' instead so that Google will index the document.
Some sheets that have this setting at the moment (public on the web) can be found on Google by searching for
"OSM-S-109" site:docs.google.com
including the quotes - should yield a single result.