Google does not index OSM Number Google Sheet

miike commented 7 years ago

As discussed in #502

@mattodd @drc007 I believe this is as the spreadsheet in question has sharing settings set to 'Anyone with the link can view' when it should be 'Public on the web' instead so that Google will index the document.

Some sheets that have this setting at the moment (public on the web) can be found on Google by searching for "OSM-S-109" site:docs.google.com including the quotes - should yield a single result.

drc007 commented 7 years ago

I'm not sure that is the solution, see https://www.labnol.org/internet/your-google-docs-in-search-engines/9590/

"because Google and other search engine bots are completely blocked (via robots.txt) from indexing content hosted on Google Docs even if the owner has made his or her documents public. "

I guess we want public read access, but project members read/write/edit?

miike commented 7 years ago

I think that link is just out of date. The current robots.txt doesn't have any explicit disallows for sheets.

Yes on the public read access (which will fall under 'Public on the web - Anyone on the Internet can find and view') and then adding write access to project members.

drc007 commented 7 years ago

OK, I've set the document to..

@cdsouthan can you see if it is now indexable?

drc007 commented 7 years ago

I don't think the "Last Updated" field is working the way we want, as far as I can tell it is updated everytime anyone accesses the sheet, not when it is updated. I think a better solution would be to add an updated date column, and then have a script that checks when a row is edited. something like https://webapps.stackexchange.com/questions/37408/auto-updating-column-in-google-spreadsheet-showing-last-modify-date

However, I've never done this sort of scripting...

cdsouthan commented 7 years ago

Arguably the most important parts of the sheet we need the Google bot to scrape are the InChIKeys (see https://www.ncbi.nlm.nih.gov/pubmed/23399051)

This can be used as a test query https://www.google.com/search?q=ISCYIQSGKXBNIC

its from

The one and only Google pop result is

As of now (11.17 CET) its not hitting the sheet ipso facto its not crawled yet

JFTR who are you @miike ?

miike commented 7 years ago

@drc007 Does the revision history (File => See revision history) do this already? I think one caveat is that it appears that only people with certain permissions (>view?) can view revision history.

drc007 commented 7 years ago

@miike I think the desire was that anyone could see if there had been any updates, and ideally what had been updated? The suggestion was to get people to amend the updated date but I can see people forgetting to do this.

drc007 commented 7 years ago

@miike @cdsouthan I've linked the Google doc on a page on my website and then submitted the page for Google indexing, let's see if it turns up.

miike commented 7 years ago

@drc007 Great, hopefully it'll appear in the search index within 24-48 hours. I'm not sure how Google works out the priority in terms of indexing docs.google.com content.

Re: logging updates - this is definitely possible using Google Apps scripts but can be a little awkward as you've got to determine the logging format etc. It may be easier to cheat and use Tools => Notification rules to send an email to an inbox that posts automatically somewhere (maybe even back to the repository?).

@cdsouthan I did some work with @mattodd on OSM around Labtrove and some automation components many moons ago during my undergrad chemistry at Sydney. I now work in the software/data analytics space.

cdsouthan commented 7 years ago

OK @miike (I was hoping you were still on the payroll so @mattodd could appoint you as Master List Master :) JFTR I was not advocating anything sophisticated on the versioning side. Simply that anyone viewing and downloading the sheet for local analysis (or paper writing) would just be able to see, if they came back a month later, that there were a few new compound rows, rather that flagging exactly which cells had changed. For the privileged editors this could be simply getting into the habit (i.e. not forgetting) of writing a new date in the title when new structures or results are entered.

miike commented 7 years ago

@cdsouthan I've got a test script that I'll upload into the spreadsheet shortly. It currrently outputs

edit time (UTC) | edit time (local, Sydney) | cell edited | value added

anytime anyone makes any modifications in the first column. Is there anything else that needs to be added? Caveat: it doesn't seem like it's possible to get the edit user easily due to sandbox permission issues.

drc007 commented 7 years ago

@miike Sounds like just what was wanted

cdsouthan commented 7 years ago

Good, @miike lets see when @alintheopen and the other team chemists pop the next strucs in

BTW - any eventual chance of an inner InChIKey Google call-out row? e.g.

https://www.google.com/search?q=ISCYIQSGKXBNIC

While this is still not self-hitting the latest sheet (over 24 hours on) note this older lead hits an older version of a GD sheet as well as PubChem and my blog (but beware the lower hits are dark-side re-cycling of my blog posts)

https://www.google.com/search?q=PMIWBIXSAYKRGF

drc007 commented 7 years ago

This can be used as a test query https://www.google.com/search?q=ISCYIQSGKXBNIC

@cdsouthan @miike I'm not seeing any hits?

cdsouthan commented 7 years ago

Well, whadaya know @cdsouthan @miike - yup the result from Wed https://github.com/OpenSourceMalaria/OSM_To_Do_List/issues/511#issuecomment-312824575 has dissapeared ! Only thing I can think of is that the Chase lab have (now) blocked the bot

miike commented 7 years ago

I've added in the function for Edit History on the spreadsheet so any edits in the first column of the first spreadsheet get logged to the second spreadsheet 'Edit History'. There seems to be a known problem with double triggering so I may need to look into implementing locks.

MedChemProf commented 7 years ago

@cdsouthan I have not made any recents edits or changes to sheet. No changes to my ELN also. https://mynotebook.labarchives.com/share/MCPHS%2520MedChem/MzgzLjV8MzY3Ny8yOTUvVHJlZU5vZGUvMjYyNTI2OTc5OXw5NzMuNQ==

cdsouthan commented 7 years ago

Thanks @MedChemProf - so that's just another example of good old Google capriciousness ........

This appears to be more consistent (this week at least..)

https://www.google.com/search?q=PMIWBIXSAYKRGF

and so indicates the InChIKey Google pops still have general utility

But just to bamboozle us (Friday 13:00)

capture

drc007 commented 7 years ago

@cdsouthan @miike @mattodd I've been doing a few experiments. I looked at a handful of the records and I can't find any hits using Google. However using Bing I get a number of hits, including google docs sources.

I find more hits when I use InChI e.g.

InChI=1S/C20H16F2N4O3S/c1-30(27,28)15-5-3-14(4-6-15)20-25-24-18-11-23-12-19(26(18)20)29-9-8-13-2-7-16(21)17(22)10-13/h2-7,10-12H,8-9H2,1H3

search using Bing finds many hits including https://docs.google.com/spreadsheets/d/1VwAvTA0VfRVBMZ9WqEMPaG3YF7-SqBAgjS95NkXJkaE/edit#gid=0

Google finds only single hit.

How do we tell Bing to explicitly index OSM pages?

drc007 commented 7 years ago

@cdsouthan @miike @mattodd OK, found out how to submit to Bing, let's see how it turns out.

cdsouthan commented 7 years ago

I think we ( @miike @drc007 @mattodd et al) need to step back a little to consider what we are trying to achieve in the specific case here of getting (ideally only the latest version) of the OSM Master List (ML) crawled/indexed by the major search engines. This is related to the wider issues of getting ELNs globally exposed and at least partially interconnected, as we wrote about in our SCINDR article https://riojournal.com/article/9995/ but we unfortunately now have to pursue in its absence :(

The basic utility is to be surfaced/findable/discoverable by human interested parties who may not know what we do but could contribute, even in small ways such as just getting the word around. While Bing has some advantages over Google (including being "cleaner") we have to focus on the former as its the default for most folk. For discoverability AWAK the problem is the restriction to string match. This means interested parties have to know exactly they are looking for in the first place, e.g. in terms of an OSM ID, MMV ID, or a connectivity layer InChIKey (given that SMILES and InChI strings give only partial unreliable matches most of the time and full InChIKeys can bring back false-positives). Thus de-novo discoverability via Google is somewhat illusory and global chemical similarity discoverability is only realistically achievable via PubChem (for which we have a lag problem).

Notwithstanding, there is utility for the project in getting the ML crawled by the Googlebot and I think is worth aiming for (and credit to the efforts being made above). The main utility is probably for those inspecting the ML to quickly get synthetic details without have to know where to navigate to/dig it out of the different team ELNs. The connectivity keys from these recent @MedChemProf outputs are good examples.

ISCYIQSGKXBNIC (G -ve but Bing +ve!)
WXNSEGHVFXDZGF (G+ve) SMRNZXNRFZNBAW (G +ve)

There are also informative paradoxes such as YFUIAROYVNJDHR having 2 OSM-only hits. The corresponding MMV670763 ID has 27 hits but also OSM-only, indicating not only that some chemists are not adding InChIKeys to their molecular definitions but also that the MMV primary repositories are not being surfaced/crawled (presumably because they include proprietary structures). In addition the hits in this case indicate most surfacings are secondary in coming from http://onsnetwork.org/about-onsnetwork-org/ via the recycling of OSM pages. NOBA, PubChem is negative for the ID and the InChIKey

drc007 commented 7 years ago

@cdsouthan @miike As far as I can tell the google doc spreadsheet is not being indexed at present. I can find multiple pages that link to the document using both Google and Bing, but no hits for the actual spreadsheet itself.

cdsouthan commented 7 years ago

Quickest solution is for @drc007, myself, or anyone in fact, to paste the latest version into a (crawled) blog page with an associated back-pointer. Its not ideal to make "secondary surfacings" in this way but we have to live with this anway, including old, bot-cloned and different lists (I plead guilty from my old blog posts)

drc007 commented 7 years ago

@miike @cdsouthan @mattodd Should we think about having a regular monthly/weekly export to a web page?

cdsouthan commented 7 years ago

Good idea @drc007. Monthly sounds about right (with skipping if no new data)

drc007 commented 7 years ago

@miike @cdsouthan @mattodd OK I've created a page by hand, we can look at automating if this solves the issue with indexing.

http://www.cambridgemedchemconsulting.com/news/index_files/c0543b419406d9e3eb6c6396a816012f-278.html#unique-entry-id-278

If you could give it some publicity it might help.

cdsouthan commented 7 years ago

Quick work @drc007. Yes I can get word around but lets monitor for a while first. Looks like they are just coming though :) Could I ask for QPQ if you could be so kind as to discretely mention/link PMID: 23399051 in your post? I'll probably tweet/LinkedIN next wed (peak time...)

capture

drc007 commented 7 years ago

@cdsouthan Done ;-)

cdsouthan commented 7 years ago

Appreciated, @drc007 but to exceed the beer-owing threshold any chance of migrating the mention to the top of the page rather that somewhat buried below the big list?

drc007 commented 7 years ago

Can do later. BBQ takes priority at present :-)

drc007 commented 7 years ago

@cdsouthan Now edited

mattodd commented 7 years ago

Interestingly the section down the bottom of the wiki pages was meant to achieve what is being described above. Woefully manual, but it ought to be picked up? If we migrate wiki content to Github might it be easier to export the Master List strings to a page there, automatically?

It does seem curious to have to do this. Is there not a way to "mirror" a spreadsheet somewhere else that is indexed?

cdsouthan commented 7 years ago

I confess to being bamboozled by, at least for InChIKey matches, what Google surfaces, from where and when or why. However, despite these grumbles nothing else can tell us exactly what's where in 0.3 secs (PubChem Entrez times out almost as often as it returns a result these days ). For example, we can pop the first Key from the wiki pages and see the below that includes an older GD sheet. N.b given your work to make the wiki pages nicely readable @mattodd please could you just put the date right at the top when you last edited it?

But wouldn't it be nice if we had a small - 'erhem - database - where we put all the structures and results with date stamps (even just manually) when they were entered and appropriate URLs to metadata?

JFTR

capture

mattodd commented 7 years ago

But wouldn't it be nice if we had a small - 'erhem - database - where we put all the structures and results with date stamps (even just manually) when they were entered and appropriate URLs to metadata?

Sorry, don't know what you mean? If we adopted an ELN like @lpatiny 's then searchability of structures would be solved, and presumably there would then be a way of compiling all the data into i) a compendium of synthesis data ("Show me all the attempts at making molecule X") and ii) a compendium of biological data (RDF triples, or something, of compound linked to property Y).

Is that what you mean by "database"? If not, then the following is off-topic:

The second of those (compendium of biological data) is harder. At the moment, manual entry of the biological data into the Master List seems to be the only way to handle things. But I wonder if we shouldn't be making it easier to do this by creating a Google Form so that anyone can:

a) Add a molecule to the Master List b) If the molecule exists in the Master List already, one is alerted to that fact and one can add data to the entry c) If the molecule does not exist already, an entry is created, along with the relevant strings and an OSM identifier can be created. Biological data can be added. (manual entry of strings, and creation of an OSM number, are trivial yet still a headache).

The reason why this is hard is that all the biological data that we might want to enter is super-context-specific, by which I mean each time bio data is entered, there also needs to be a link to a relevant assay and a page containing the raw data. No easy solve for that.

Happy to defer discussion of this to another time and place, and revert back to what we're meant to be discussing, which is how to ensure the Master List is crawled.

mcoster commented 7 years ago

It does seem curious to have to do this. Is there not a way to "mirror" a spreadsheet somewhere else that is indexed?

Am I missing something - in Google Sheets, don't you just go "File -> Publish to the Web" and keep "Published content & settings -> Automatically republish when changes are made" checked? Could set up a URL redirect or similar from a more reader friendly URL and make sure that is crawled?

drc007 commented 7 years ago

@mcoster That is what we have done, and the page that directs to the Google sheets has been indexed by Google. However it may be that Google does not index google docs themselves.

mattodd commented 7 years ago

Wait, now I'm sitting in the same confused train that @mcoster is on. If there is a public page that is mirroring the sheet, and if that is being indexed, as you say, then...? Why are we not all happy? Is it because a visitor to the page that was indexed by Google cannot navigate back to the source? (In answer to an earlier observation re Bing. We need to aim for Google. If Google's not indexing our molecules we have an issue to solve).

drc007 commented 7 years ago

@mattodd The issue is that whilst we can have a public page that mirrors the google docs spreadsheet it currently has to be created by hand. This raises a couple issues.

The mirror page could be out of date
The user has to click on a link to get to the actual master list
We had hoped to have a real-time, automatic cross-referencing system "SCINDER" and I was hoping that Google might be a component. Unfortunately it does not look reliable enough.

I used Bing to show that pages were visible to a search engine, even if Google did not find them.

cdsouthan commented 7 years ago

I think we can sleep at night, as exploratory Google pops now indicate a high incidence of findability somewhere or other. . However @drc007 may not be blissfully happy in that that he now has to remember/be most kindly prompted to refresh the mirror sheet now and again. Given the currently modest acquisition rate of new structures and results this would not seem to onerous (a monthly check?). Future automation of structure/results/data capture/PubChem submissions etc etc .. is a different topic

mattodd commented 7 years ago

OK. And if someone does find that page via a Google search is it obvious where to go to for more? i.e. to find the source? We can include the need for a periodic refresh in any future "Tech-Ops" page.

drc007 commented 7 years ago

@mattodd The page is here http://www.cambridgemedchemconsulting.com/news/index_files/c0543b419406d9e3eb6c6396a816012f-278.html#unique-entry-id-278

It will be automatically reindexed by Google and Bing if updated.

miike commented 7 years ago

It still doesn't look like Google has indexed the document on Gdocs which is shame. What do people think about mirroring the master list on Github as a markdown table? (we'd have this setup to be automated so it'd pull every 24 hours or so) This would have limited interactivity but will provide something visually similar and has a much higher likelihood of getting a higher priority from Google for indexing.

OSM Number	Internal ID	PubChem CID	MMV ID	SMILES	InChI	IChI Key	Series	Assays	PfaI EC50 (Inh)	Pfal IC50 (GSK)	Pfal IC50 (Syngene)	Pfal IC50 (Dundee)	Pfal IC50 (Avery)	Pfal (K1) IC50 (Avery)	Pfal IC50 (Ralph)	Pfal IC50 (Guy)	Pfal (K1) IC50 (Guy)	Pfal IC50 (Batra)	Pfal (K1) IC50 (Batra)	Pfal (3D7) IC50 (Broad)	Pfal (Dd2) IC50 (Broad)	PfaI EC50 uMol (Mean) Qualifier	PfaI EC50 uMol (Mean)
OSM-A-1		1233035		O=C(/C(S/1)=C/C2=C(C)N(C(C)=C2)C3=CC=C(S(=O)(N)=O)C=C3)NC1=N\C4=CC=CC=C4	InChI=1S/C22H20N4O3S2/c1-14-12-16(15(2)26(14)18-8-10-19(11-9-18)31(23,28)29)13-20-21(27)25-22(30-20)24-17-6-4-3-5-7-17/h3-13H,1-2H3,(H2,23,28,29)(H,24,25,27)/b20-13-	ODRSSOQWOHNABY-MOSHPQCFSA-N	1									3.05	4.379						3.7145
OSM-A-10				O=C([H])C1=C(C)N(C(C)=C1)C2=CC(OC)=CC=C2	InChI=1S/C14H15NO2/c1-10-7-12(9-16)11(2)15(10)13-5-4-6-14(8-13)17-3/h4-9H,1-3H3	XTPVHYRCFNPCRL-UHFFFAOYSA-N	1																#DIV/0!

cdsouthan commented 7 years ago

JFTR @mcoster your excellent video https://www.youtube.com/watch?v=X_tREXdHRE4 is relevant to a lot of above

cdsouthan commented 7 years ago

@miike I suspect the reason for G not indexing (or at least taking a long time to) what OSM puts on GD is the same as for anything else. There is very little traffic, inlinks or outlinks. By comparison @drc007 's http://www.cambridgemedchemconsulting.com/ gets a lot of traffic, and has many links, ipso facto gets crawled more. NOBA https://cdsouthan.blogspot.se/ also gets crawled for new posts very quickly

capture

But, my experiment with updating an older post with an extract from the ML took much longer (and CMCC outranks me)

capture

Note also a certain irony with the latest surfacings from the pop below. The activity right here on this issue has not only been crawled but (presumably because of the traffic and outlinks) the mention at the top of this thread now gets the highest ranking!

capture

I would therefore not bother doing much else just now, beyond promising @drc007 a beer for the occasional mirroring.

The answer to @mattodd 's Q "if someone does find that page via a Google search is it obvious where to go to for more? i.e. to find the source?" strictly speaking, is no, since there are no metadata outlinks on the master sheet. Notwithstanding, there are for most InChIKey or code number pops enough matches to various ELNs for someone be able to drill down, or at least find out who to call. Note also, since all this depends on exact matches, the person doing the pop, by definition, has already found a starting point.

mcoster commented 7 years ago

JFTR @mcoster your excellent video https://www.youtube.com/watch?v=X_tREXdHRE4 is relevant to a lot of above

Thanks, I've been loosely following this Issue and been loaded with teaching the last few days. As a newcomer, it is taking me a while to get to grips with how current issues fit in the overall context, and I haven't been involved in the Master list yet. From the standpoint of training my project students how to write their ELN entries, I will be asking them to paste in InChI strings, InChIKeys and SMILES - seems like having all three will be good for discoverability.

cdsouthan commented 7 years ago

@miike what does this master sheet edit history mean? Lots of cell changes (what type?) in early Aug but duplicated additions on 4th? So this is the Aug 4th 15.46 version right?

capture

OpenSourceMalaria / OSM_To_Do_List

Google does not index OSM Number Google Sheet #511