Closed amoeba closed 6 years ago
I think one other reason that Google may not index is lack of a sitemap.xml file that lists all of the dataset landing pages. Google could in theory follow the search results page links, but it may not properly parse those. Not sure. Metacat has a sitemap generator which probably needs to be updated to reflect our new landing page URI formats, but is likely adequate to the task of generating a full suite of dataset lists. See edu.ucsb.nceas.metacat.Sitemap.java for details.
Thanks for looking into this. I agree it would be a nice improvement. We should make sure we do all the development for this in a branch.
We are using Backbone 1.3.3 in MetacatUI right now, which I believe is the latest version; so no upgrade needed.
When we first started MetacatUI, there was not enough support for the HTML5 pushState()
method for the browser history, so we punted and left the hashtag in the URL. I think there's plenty of support now, sow we can use the Router.history.start({pushState: true})
and manage client side routes. Of course, figuring out the redirects for consistency like @amoeba said above is the trick.
Google Bot ignores the fragments by default Source
The Googlebot is responsible for crawling sites to find content and embedded links that will become part of the Google search index. It fetches and parses HTML, but it’s not a full blown browser and doesn’t have a JavaScript engine. As a consequence, it will normally ignore fragment identifiers and just look at the resource returned from the web server. Any JavaScript used by your page to load or build content will not be executed.
To use this indexing scheme you would first need to change all your (UGLY URL
) fragment identifiers to start with a !
symbol:
`http://www.example.com/ajax.html#mystate`
would need to change to a `PRETTY URL`:
`http://www.example.com/ajax.html#!mystate`
The presence of the leading `!` indicates to Google that you support this scheme.
Also, your page needs to be able supply the HTML for a given state in response to a query string parameter named _escaped_fragment_
. When the Googlebot detects a PRETTY URL
, (that is, a URL containing a #! hash fragment). It then requests the content for this URL from your server in a slightly modified form as shown below:
`http://www.example.com/ajax.html?_escaped_fragment_=mystate`
Your web server returns the content in the form of an HTML snapshot, which is then processed by the crawler. The search results will show the original URL.
Official deprecation of AJAX Crawling (above discussed approach) since October 2015 (c) Google
History API pushState()
to ensure accessibility for a wider range of browsers (and Google Bots).
Thanks for the notes @rushirajnenuji. So does it seem pretty feasible to switch URLs like
to
?
What do you think about making the change on a branch (maybe just to the #view route), deploying it somewhere, and then pointing Google's Webmaster Tools at it to check that it works before we get much further along?
@amoeba yes, sure. I think we should give it a try. I'll create a branch and start working on it.
I compiled a list of the browser versions our visitors use on arcticdata.io and search.dataone.org. I cut off after 1%, so this is not a full list.
This should be helpful for checking if we can support the majority of our users with the pushState()
change.
Browser | Browser Version | Sessions | % Sessions | New Users |
---|---|---|---|---|
Chrome | 60 | 902 | 9.49% | 445 |
Chrome | 56 | 898 | 9.44% | 443 |
Chrome | 61 | 827 | 8.70% | 438 |
Chrome | 59 | 757 | 7.96% | 368 |
Chrome | 58 | 633 | 6.66% | 335 |
Chrome | 55 | 546 | 5.74% | 283 |
Chrome | 57 | 463 | 4.87% | 229 |
Chrome | 62 | 444 | 4.67% | 210 |
Firefox | 52 | 281 | 2.96% | 151 |
Internet Explorer | 11 | 280 | 2.94% | 235 |
Firefox | 54 | 207 | 2.18% | 112 |
Firefox | 53 | 194 | 2.04% | 120 |
Firefox | 51 | 193 | 2.03% | 104 |
Firefox | 50 | 170 | 1.79% | 78 |
Safari | 10 | 170 | 1.79% | 105 |
Firefox | 56 | 156 | 1.64% | 102 |
Firefox | 55 | 154 | 1.62% | 100 |
CrossrefEventDataBot | (not set) | 143 | 1.50% | 142 |
Chrome | 63 | 116 | 1.22% | 70 |
Firefox | 57 | 100 | 1.05% | 67 |
Internet Explorer | 8 | 97 | 1.02% | 97 |
Browser | Browser Version | Sessions | % Sessions | New Users |
---|---|---|---|---|
Chrome | 56 | 2,108 | 11.24% | 533 |
Chrome | 58 | 1,770 | 9.44% | 357 |
Chrome | 57 | 1,613 | 8.60% | 241 |
Chrome | 59 | 1,499 | 7.99% | 231 |
Chrome | 60 | 1,448 | 7.72% | 298 |
Chrome | 61 | 1,289 | 6.87% | 308 |
Chrome | 55 | 997 | 5.32% | 232 |
Safari | 10 | 768 | 4.10% | 371 |
Chrome | 62 | 732 | 3.90% | 177 |
Firefox | 51 | 425 | 2.27% | 164 |
Internet Explorer | 11 | 414 | 2.21% | 280 |
Firefox | 54 | 395 | 2.11% | 120 |
Firefox | 52 | 385 | 2.05% | 193 |
Firefox | 53 | 292 | 1.56% | 146 |
Chrome | 63 | 250 | 1.33% | 58 |
Firefox | 57 | 245 | 1.31% | 78 |
Firefox | 56 | 210 | 1.12% | 105 |
Firefox | 55 | 188 | 1.00% | 93 |
Firefox | 50 | 177 | 0.94% | 102 |
Full spreadsheet here: https://docs.google.com/spreadsheets/d/1QbcngOLWGLB58Nisa9RJM7sblCQeN2DSHRTkVdd11e0/edit?usp=sharing
Thanks for compiling that list, @laurenwalker. That looks quite promising for us to use pushState
, and showing users without that functionality enabled/available to the app a notice.
@rushirajnenuji and I are still working on this testing whether it's feasible to make this change in the router at all. He made the initial change to the codebase on a branch and he and I are currently figuring out the necessary Apache config.
I did some testing and it seems like nothing (so far) in MetcatUI is problematic but we are struggling to find the right Apache config to make stateless entry work (e.g., be able to enter the app at /view/{PID}
. I think we just have conflicting Apache config directives so I don't expect this to be too hard to address.
I think I figured out the Apache config issues preventing part of this from working, though my understanding of Apache configs is apparently not good enough to know why.
Now I'm struggling with Require.js and its interaction with the rewrite rules. When this code is called,
require(['views/MetadataView']...
the request that's sent is to
https://handy-owl.nceas.ucsb.edu/view/js/views/MetadataView.js?v=2.0.0RC4
while the path the file exists on is just /js/views/MetadataView.js?v=2.0.0RC4
. Require.js has configuration and it looks like this might be overridable.
Great progress: I figured out that the issues was simple in that Require.js assumes relative URLs are relative to your URL (duh, right?). So I went through and made every URL referencing JSS/CSS/etc I could find absolute (js/
=> /js
) and so on. Also had to make a tweak to the MetadataView but otherwise this appears to be sorta working: https://handy-owl.nceas.ucsb.edu/view/urn:uuid:3ec5f245-ebe8-4870-bc0c-73d5f2ec7a87
This is pretty close to where it can be discussed with the group.
The relative URLs are relative to the baseUrl
attribute stated in the call to requirejs.config()
. Maybe the docs can help with this, but it sounds like sticking with relative URLs helps when you use their optimizer, which we have a ticket for.
Anyway, is the Router
matching the /view/{pid}
correctly now that you're using pushState
in that branch?
The relative URLs are relative to the baseUrl attribute stated in the call to requirejs.config().
That's not quite what I was seeing, but we could take a look. They ended up being relative to the current URL path. Since we are artificially putting the user on /view/{pid}
or /profile/{profile}
, a Require.js call like require(['components/x.js'])
results in a call in the browser to /views/components/requires.js
which is not where the JS file is. Changing everything to absolutes is the only fix I've found works so far.
That level of monkeying around worries me so we still have more to do see if this would work and should be implemented.
Anyway, is the Router matching the /view/{pid} correctly now that you're using pushState in that branch?
Yes, that works (had to change code in the route). I'll push my work up so we can see the diff.
So we can look together and get some feedback, I've pushed a commit to @rushirajnenuji 's support-shareable-urls
branch with what I think is the minimum required set of changes to make this work: https://github.com/NCEAS/metacatui/commit/059bb0332592a77f8b2e642839cc776d2bbae966
Note: Making of some URLs absolute paths was unavoidable. This presents a liability in that we either:
We talked about this on our dev call today and it was mentioned that I never shared the Apache config. We took a basic Apache config and I had to add:
<Directory "/var/www/handy-owl.nceas.ucsb.edu">
Options Indexes MultiViews FollowSymLinks
AllowOverride None
Require all granted
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.html$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.html [L]
</IfModule>
</Directory>
I thought I had to have a RewriteCond that ignored the metacat context but I'm not seeing that in my config and it works. To be investigated further.
Re: bullet points 1 and 2 in the second above comment, we discussed strategies for this on our last dev call and one of those strategies was to factor our the deployment path into an app-wide config and use that everywhere the app tries to request static files (images, js, css).
I..
baseUrl
, which is what we'd modify for deployments that are not at the web root./components/...
to <%=MetacatUI.baseUrl%>./components/...
and this worked nicelySo this appears to be a go. I know @csjx is refactoring this part of of the app so I'll hold off on this for now.
Great news: Googlebot can render MetacatUI MetadataView
pages perfectly AFAICT.
Wrote this work and the work on #432 up into a Google Doc so we could discuss on a dev call and also get early feedback from Dave V: https://docs.google.com/document/d/1yU-d-aFdtiSB91Wk0sFj1xthW8skBPof9qKwx00bjxE/edit?usp=sharing
Got some really good feedback, but no comments indicating a change in course here.
I was doing some searching around while continuing this work today and it looks like Apache 2.2+ has FallbackResource
which does all of the work of the ModRewrite chunk I had in the test config before.
Before:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /metacatui
RewriteRule ^index\.html$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /metacatui/index.html [L]
</IfModule>
After
FallbackResource /index.html
🎉
Awesome! Can it be restricted to only the url path in which MetacatUI is deployed? (So that 404 errors are returned for other nonexistent resources.)
Lauren
On Wed, Jun 6, 2018, 6:33 PM Bryce Mecum notifications@github.com wrote:
I was doing some searching around while continuing this work today and it looks like Apache 2.2+ has FallbackResource http://httpd.apache.org/docs/trunk/mod/mod_dir.html#fallbackresource which does all of the work of the ModRewrite chunk I had in the test config before.
Before:
<IfModule mod_rewrite.c> RewriteEngine On RewriteBase /metacatui RewriteRule ^index\.html$ - [L] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /metacatui/index.html [L] </IfModule>
After
FallbackResource /index.html
🎉
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCEAS/metacatui/issues/431#issuecomment-395235025, or mute the thread https://github.com/notifications/unsubscribe-auth/AGVeFtBaP5oTkRquvd12bdbdpJZa_dtPks5t6FixgaJpZM4RSZff .
Yeah. I'm testing locally with metacatui deployed in a subdir like:
<Directory "/usr/local/var/www/metacatui">
FallbackResource /metacatui/index.html
</Directory>
and it all seems to work. The config will still need to hard-code where MetacatUI is deployed here and also at least once within the app itself. Still figuring out how this will work and work well.
Closing as there's no work left to be done. Implementation work is being done in other Issues under the https://github.com/NCEAS/metacatui/labels/routing%20refactor label.
We've talked about this as a group numerous times over the years and the need to reconsider this has cropped up recently due to some GeoLink work @rushirajnenuji and I are doing. We think MetacatUI dataset landing pages (
#view/{PID}
) are not indexed by Google (at least in part) due to the use of hash fragments in our URLs. Another possibility may have to do with the use of a client-side web app though my understanding is that Google has sufficient ability to index client-side webapps as of a few years ago.Concerns:
Changing our URL structure from
.../#view/{PID}
to.../view/{PID}
will break many URLs (mostly DOIs) so the solution needs to properly redirectThere may be some browser compatibility issues
From: http://backbonejs.org/#Router:
Are we using a new enough version of Backbone.js? If not, can we migrate up?
Specific tasks here include: