Documentation doesn't show up well in Google

ragerdl commented 10 years ago

From rage...@gmail.com on January 15, 2014 10:48:38

It seems that fancy xdoc 6.3 and beyond aren't accessible via search engines (at least, not the version hosted at centaur). We can see topics from 6.2 and earlier via the following search: https://www.google.com/search?q=site%3Afv.centtech.com+hide But if we restrict the search to 6.3, we see no results: https://www.google.com/search?q=site%3Afv.centtech.com+hide#q=site:fv.centtech.com+hide+6.3 Now, Google is notoriously bad for searching ACL2 documentation (often presenting older topics first, not providing complete results [try searching for "6.2" and "hide" in the same query]). That being said, it'd be good if we could somehow make these topics findable via Google. Here is a possible suggestion:

(1) Bring back the HTML directory, and have every page contain the content it used to AND (2) In big bold letters at the top, present the user with a link to the fancy xdoc page with the topic set to the appropriate name.

This would allow users to find the page and then also easily redirect to its fancy counterpart.

We might be able to get away with leaving out the content and automatically redirecting them, but I kind of doubt it. It's surprising that Google doesn't automatically crawl http://fv.centtech.com/acl2/latest/doc/ , so if this isn't actually broken, please close the issue.

Original issue: http://code.google.com/p/acl2-books/issues/detail?id=132

ragerdl commented 10 years ago

From rage...@gmail.com on January 15, 2014 08:49:15

Technically the HTML pages probably don't need to be pretty -- they just need to have both :short and :long sections.

ragerdl commented 10 years ago

From matthew.j.kaufmann@gmail.com on January 15, 2014 09:34:29

For what it's worth, if I put the following into google I get a few hits -- but not nearly enough. So something is working, but only a little. Hmmmm.

book site: http://fv.centtech.com/acl2/latest/doc/

ragerdl commented 10 years ago

From matthew.j.kaufmann@gmail.com on January 15, 2014 09:37:18

In case this is relevant: The way I now search the documentation is typically in the ACL2-Doc Emacs browser, with the "s" (or, for regular expressions, "S") command for the first occurrence, and "n" for subsequent occurrences. (I know, I know, not everyone wants to use Emacs like this....)

ragerdl commented 10 years ago

From matthew.j.kaufmann@gmail.com on January 15, 2014 09:42:03

Sorry about the multiple messages -- stuff keeps occurring to me....

If anyone would use it, I would probably be happy to add a new command to ACL2-Doc (and this should be easy to do) that displays a list of all topics that contain a given string or regular expression.

ragerdl commented 10 years ago

From jared.c....@gmail.com on January 15, 2014 10:01:20

I'm removing myself as the owner of this issue. If other folks want to pursue SEO for XDOC that's just fine, but I have very little interest in this.

Owner: ---

ragerdl commented 10 years ago

From jared.c....@gmail.com on January 17, 2014 06:22:09

Summary: Documentation doesn't show up well in Google (was: [xdoc] Accessibility from google)
Labels: -Priority-Medium Priority-Low

ragerdl commented 10 years ago

From rage...@gmail.com on January 23, 2014 10:36:12

It's important [to me but I believe also to others] that we can search the :long sections of the documentation outside of emacs, so I'm bumping this up to medium priority.

That being said, the ACL2 homepage reminds me that it might take a bit post 6.4 release for Google to update, so I'll follow back on this later. Maybe we don't really have a problem.

Owner: rage...@gmail.com
Labels: -Priority-Low Priority-Medium

ragerdl commented 10 years ago

From jared.c....@gmail.com on January 23, 2014 11:21:31

For what it's worth...

Matt's ACL2-Doc tool offers some kind of substring-based search feature. You seem to already know about this, but maybe you should use it when name/short searches aren't giving you what you want.
For the offline (non server-supported) version of the fancy viewer, it would be straightforward to add dumb substring searching of the :long sections, because you have the full :long data available. Along with importance ordering, this might not be too bad.
For the server-supported fancy manual, this isn't as easy to implement because the browser doesn't have the data loaded, and you ideally don't want to download the 20+ MB xdata file. We could do something lame like make you download the whole thing if you want to search it. Or we could add more server-side code to support searching by the server, but I don't really want to write any more server-side code, because it complicates deployment.

In the long run I want to do something smarter than simple substring-based searching. I have some preliminary code (not committed) for creating word tables, etc., and some thoughts about how to encode these efficiently for use in both the online and offline versions of the manual. But I haven't worked on this in a couple of months and I don't know when I'll get back to it. (The dumb :short string search seemed useful enough that I'm not sure it's worth investing a lot of effort in :long string searching.)

I'm of course happy to share what little I have with anyone who wants to implement a search feature.

ragerdl commented 10 years ago

From matthew.j.kaufmann@gmail.com on January 23, 2014 13:00:41

I'm pretty sure that the google-based doc search would be working by now, if it would ever work. I just don't think it's set up to search the xdoc-based manual.

I've updated the ACL2 home page to remove the google-based instructions that were there, instead clarifying what can be searched where. (Thanks to David and Jared for feedback, though responsibility for perhaps over-clarifying is mine.)

ragerdl commented 10 years ago

From jared.c....@gmail.com on May 30, 2014 21:55:59

I don't know anything about any of this, but I messed around with trying to get Google to index xdoc manuals tonight. I was able to use their webmaster tools, https://www.google.com/webmasters/tools/home , To claim ownership of fv.centtech.com (which involved adding a file that they gave me to the web server's root, to prove that I had access to the web server, I guess).

Once that was done, I was able to tell the "googlebot" to fetch URLs such as http://fv.centtech.com/acl2/latest/doc/?topic=ACL2____TOP , and it showed me a picture of the site it had rendered. It looks like they understand the javascript well enough to fetch the content, etc., at least when given the explicit topic to look up.

It doesn't seem like it has tried to index the site yet. I don't see any indication of when I can expect that to happen, if ever. But I did mess around with as much as I found---I told it about URL parameter settings, and tried (unsuccessfully) to use the "data highlighter" to help inform their algorithm about how things work, but it wouldn't work because apparently the page hasn't been indexed yet.

One thing I did stumble upon that seems like it might lead to a solution is their Sitemaps feature. I tweaked XDOC ( r2780 ) so that it now generates a (preliminary) sitemap.xml file. This file will need a quick search/replace to insert the right URL, but in principle it seems like these simple site maps might be enough to help Google understand what pages are available.

I don't know if any of this will work. The next step is to build a proper sitemap for a current manual and try to get Google to take it. At best, this might get it to index the fv.centtech.com version of the pages. Maybe if that works, someone else can do the same thing to get the "official" UTCS versions working, too.

ragerdl commented 10 years ago

From rage...@gmail.com on June 01, 2014 14:13:20

fwiw, I tried a search today for "set-guard-checking" under site: http://fv.centtech.com/acl2/latest/doc/ and it came back with nothing. I looked at sitemaps a long ago and came up with a similar conclusion that Google's already pretty good at parsing stuff. Yet, they fail on our site, so thanks for your investigation and development thus far.

ragerdl commented 10 years ago

From rage...@gmail.com on July 15, 2014 17:57:00

I looked at http://fv.centtech.com/acl2/current/doc/sitemap.xml after reading up on sitemaps, and it looks just fine.

I also ran it through a couple sitemap xml validators and did not receive any warnings or errors.

The only idea I have right now is to try adding the "index.html" before the question mark.

ragerdl commented 10 years ago

From rage...@gmail.com on July 15, 2014 18:19:05

So I'm using utcs to host a mirror of the sitemap, with the URL replaced with "index.html" included. That was easy, thanks!

Now that I've uploaded the sitemap using Google webmaster tools, it says that there are 8,907 pages to crawl (which it hasn't yet). This is consistent with Jared's story, so I guess we'll see whether I encounter the same problem.

Here's a nice topic describing how to use sed to perform a search and replace on sitemaps.xml: http://www.cyberciti.biz/faq/unix-linux-replace-string-words-in-many-files/

ragerdl commented 10 years ago

From rage...@gmail.com on July 16, 2014 09:22:58

Here's a moderately interesting error message that I get when running the Google "Structured Data Testing Tool" that's part of Google labs.

XDOC (Loading) www.cs.utexas.edu/users/ragerdl/acl2-manual/index.html The excerpt from the page will show up here. The reason we can't show text from your webpage is because the text depends on the query the user types.

ragerdl commented 10 years ago

From rage...@gmail.com on July 17, 2014 03:49:16

This search result may be indicative of what's going on. Basically, my hypothesis is that Google sees the "loading" messages, thinks the page is rendered, and then saves that in its database. Thus, none of the links to topics are saved in the Google database. https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=site%3Ahttp%3A%2F%2Fwww.cs.utexas.edu%2Fusers%2Fragerdl%2Facl2-manual%20loading Screenshot attached

We may need to print out ugly html files that contain all of the text for keyword-based indexing but function as redirects to the appropriate index.html?topic=foo.

Attachment: loading.png

ragerdl commented 10 years ago

From rage...@gmail.com on July 17, 2014 03:52:45

Here's an article that describes our problem and some solutions: http://www.smashingmagazine.com/2011/09/27/searchable-dynamic-content-with-ajax-crawling/

ragerdl commented 10 years ago

From rage...@gmail.com on July 20, 2014 10:41:57

So that someone else doesn't pick this up and decide to knock it out: I have a patch to commit that will make this work. I just haven't had time to commit it yet.

ragerdl commented 10 years ago

From rage...@gmail.com on July 21, 2014 18:53:50

It appears that my patches (which aren't linked here for some reason) still allow the manual to build without an error. I plan to test xdata2html.pl and how its results interact with Google later (tomorrow?).

ragerdl commented 10 years ago

From rage...@gmail.com on July 22, 2014 08:11:29

This is mostly done. I plan:

(1) To verify that the javascript redirection trick works (2) To email kaufmann a command so that we can see if this works in current

After the 7.0 release (whenever that may be), we will need to figure out whether we still want current to contain HTML files. Our instinct for the moment is "no," but we can decide then. We might just make a decision based upon how google pageranks 7.0 vs current (if they're the same, then we would unlink current HTML files)

Clop suggests that it might be best to make the search-engine-optimized (HTML) docs match the released version, on the grounds that users who search for docs should find the docs for the version they are most likely to be using, and for most users that's probably the released version.

A counter point, though, is that google is probably going to be a substitute for :long searching for awhile. But, those of us who do development have our ways of getting around the :long issue anyway.

ragerdl commented 10 years ago

From rage...@gmail.com on July 23, 2014 21:15:56

@Sebastian, do you have some time to help debug this issue? It appears that even though I've got a 5 second delay before redirecting in the javascript, that Google still refuses to index the html pages (which contain that redirecting javascript).

ragerdl commented 10 years ago

From rage...@gmail.com on July 23, 2014 22:19:35

Maybe I should make the javascript redirect occur on a trigger (move the mouse?)... that might cheese the search engine according to this page: http://www.thoughtspacedesigns.com/blog/post/whats-this-googlebot-processes-javascript/ Alternatively, I might get rid of the redirect and just put a big link at the top that says "access in the full manual"

ragerdl commented 10 years ago

From rage...@gmail.com on July 28, 2014 06:45:14

Moving this to 6.6 as a signal to Matt that I don't plan to have this done by the 6.5 release.

FWIW, I think I'd have something worth running for 6.5 if I had either (1) made a backup copy of the working file before editing it again or (2) been using a git checkout of the books (not beating a dead horse -- just making a note to myself as a "real" example).

Labels: Milestone-Release6.6

ragerdl commented 10 years ago

From rage...@gmail.com on July 31, 2014 19:54:34

Linking revision for update since I didn't use the right magic phrase to auto-link: https://code.google.com/p/acl2-books/source/detail?r=2923

ragerdl commented 10 years ago

From rage...@gmail.com on August 04, 2014 10:14:55

Updating the status in hopes of getting some help from someone who knows about SearchEngineOptimization:

-- The current implementation can be viewed at http://www.cs.utexas.edu/users/ragerdl/acl2-manual -- If one does a search for "site: http://www.cs.utexas.edu/users/ragerdl/acl2-manual bind free", one receives a reasonable result

-- However, if one searches for "site: http://www.cs.utexas.edu/users/ragerdl/acl2-manual set guard checking", the best search result is missing. This is because only 3k of the 9k pages have been indexed by google. \ I've just now submitted the acl2-manual/HTML/ page and its direct links to the search index, but I don't expect that to help much

-- I don't remember for sure, but I think I chose a 5 second redirect delay, because when I "fetched as Google", 3s resulted in seeing the redirected page, but 5s results in Google not waiting for the redirect (and seeing the current page)

-- Searching for "acl2 6.4 bind free" doesn't return my page. Is this because my version of the manual doesn't have as many cross-references since "subtopics" aren't directly linked? I doubt this is the problem, because the old manuals didn't show the subtopics. Maybe it's just that my pagerank is bad (although Google states that this is an outdated concept).

If I knew what was necessary to fix this problem, I'd be willing to do the programming to make it work. If anyone else wants to play with it, you'll want to modify books/xdoc/fancy/xdata2html.pl and build doc/top (which copies that perl script to books/doc/manual).

ragerdl commented 10 years ago

From rage...@gmail.com on August 06, 2014 10:19:33

FYI, @sjcjoosten has agreed to have a look at this in a couple weeks.

sebastian commented 10 years ago

Heads up, you got the wrong @sebastian.

ragerdl commented 10 years ago

Hi,

Thanks for letting me know!

David

On Wed, Sep 3, 2014 at 8:48 AM, Sebastian Probst Eide < notifications@github.com> wrote:

Heads up, you got the wrong @sebastian https://github.com/sebastian.

— Reply to this email directly or view it on GitHub https://github.com/acl2/acl2/issues/133#issuecomment-54299486.

ragerdl commented 10 years ago

We've got a work-around now that results in a php version of index.html that should be parsable by search engines. It's in my inbox -- if someone else has the desire to muck with getting it into our build system, I can forward it. Else, I'll take care of it at some point.

Adding to 7.0 milestone since I think the hardest part is now done.

sjcjoosten commented 10 years ago

It is running (and being indexed) right here: http://cs.ru.nl/~bjoosten/acl2db/index.php and can be generated from the xindex.js and xdata.js files. It should also be possible to get the index.html and the index.php running side by side, using the same images and stylesheets.

ragerdl commented 9 years ago

This still seems to be broken. Moving back to 7.1 milestone.

Example search:

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=acl2+%22set-guard-checking%22+6.5

sjcjoosten commented 9 years ago

Hi David, it seems that google has an old version cached, namely: http://webcache.googleusercontent.com/search?q=cache:FTc1UahWH0AJ:www.cs.utexas.edu/users/moore/acl2/manuals/current/manual/index-seo.php/COMMON-LISP____DOCUMENTATION%3Fpath%3D828+&cd=1&hl=en&ct=clnkhttp://webcache.googleusercontent.com/search?q=cache:FTc1UahWH0AJ:www.cs.utexas.edu/users/moore/acl2/manuals/current/manual/index-seo.php/COMMON-LISP____DOCUMENTATION?path=828+&cd=1&hl=en&ct=clnk on January 4th, at 19.39 european time, all the links on the left were to a cs.ru.nlhttp://cs.ru.nl address. I'm guessing that this is an old error, since the current version of the page seems to work just fine. We may just have to wait for google to re-index this page.

For those whoe are site-managed, try giving Google this URL to index, including direct sublinks, as it contains a link to every page: http://www.cs.utexas.edu/users/moore/acl2/manuals/current/manual/index-seo.php/* (in fact, putting this link here may also help to get it crawled :-P)

ragerdl commented 9 years ago

Ah ha, indeed, it does seem that the cached page is out-dated... odd. So, we shall bide our time some more. At some point we'll want to take down the page hosted in cs.ru.nl, but I don't think it's harming anything right now?

sjcjoosten commented 9 years ago

I added several links from the cs.ru.nl domain to the cs.utexas.edu page. Most importantly, I'm now telling search engines that the canonical version is hosted at utexas. This should boost the utexas pages, but it is most important that those pages get fixed (it currently shows a database error).

As a final stage of phasing out the cs.ru.nl pages, I can add a http redirect, but I don't feel comfortable redirecting people to a broken page.=

ragerdl commented 9 years ago

Thanks for the heads up about the database error. We had it working at some point... now just to get it working again....

ragerdl commented 9 years ago

FYI, http://www.cs.utexas.edu/users/moore/acl2/manuals/current/manual/index-seo.php/* is working again. I've resubmitted it to the google index. I'm going to leave this issue open until there's evidence of it actually working on the UTCS servers.

ragerdl commented 9 years ago

17 months later:

http://www.cs.utexas.edu/users/moore/acl2/manuals/current/manual/index-seo.php

Rejoice!

acl2 / acl2

Documentation doesn't show up well in Google #133