Closed LisaGee closed 8 years ago
@shawnbot - got thoughts?
One glaring issue here is that without at least some subset of the school data columns in this repo, we can't statically generate a sitemap for Google to crawl. If we had the school names and IDs only, we could at least create one that would list the URLs of each school.
They only benefit I can see of a Sitemap here is ensuring all of the school-specific URLs are listed someplace so that they get indexed. Like @shawnbot says, without access to the data (at least sufficient to generate the URLs), we can't create a static page for this. Some thoughts:
I also would need to know what the school "permalink" is. It it just https://college-choice.18f.gov/school/?117803-Los-Angeles-County-College-of-Nursing-and-Allied-Health (with the expectation that the hostname will be different)?
Yeah @dnesting, I think we'd need to check in a CSV with just the name and id columns of the schools we care about. The permalink URLs could then be generated in Jekyll with something like:
/school/?{{ school.id }}-{{ school.name|replace:' ','-' }}
I think you want a query like: https://ccapi-dev.18f.gov/v1/schools?fields=id,school.name&per_page=10
@dnesting Do you have what you need? If you need, @hollyallen can create a .csv file for you that has all the school names.
I chatted with @ultrasaurus about this yesterday and I think the approach we're going to run with looks like this:
@dnesting Can you point @meiqimichelle to where the file is and she can deploy it into our repo. This should be done today if at all possible. Thanks.
The file is generated by the make_sitemap.rb script in pull request #3. What we can do is finalize that PR as it stands (defer the copy-to-s3 logic until a later PR) just so that we have something checked in. I'll do that now. We also need a robots.txt change to point search engines to the sitemap.xml URL that we end up using. Those last two pieces (what bucket/name to use for the file, and what the user-visible URL is corresponding to that) were all that we were waiting on for this.
@dnesting Is there impact in where we put the sitemap.xml, or we just need a place to host a file? If we just need a place, can we use the s3 bucket that we're using to host the full data dump and small sample csv's ? If yes, @diego- knows how to put files there (where it is, etc).
I can change the robots.txt after we have the url. Do you know the proper robot.txt syntax or should I investigate?
Generally, I’ve always place the robots.txt file in the root directory of the website, which I’m still of the belief will also be in an S3 bucket. Can we do that?
@LisaGee I'm not worried about the robotx.txt placement -- it will go with the rest of the site, and it is already in root. I'm confused about the xml file then....I was thinking we wanted it somewhere else because it is big and maybe we wouldn't want to push it with the rest of the app every time we update the site? I suppose that doesn't mean we can't place it in the same s3 bucket. In any case, for now I'd rather stick it somewhere so we can close this issue. Putting it with the data files seems OK to me. Maybe that's the s3 bucket we're intending to use for the site in the long run anyways.
Got it. I think I've typically put the sitemap.xml in root too, but I defer to anyone else who has a strong POV.
Agree. What do you think, @dnesting? Can we just put sitemap.xml at root?
Ack, sorry for not responding on this earlier. The placement of sitemap.xml doesn't matter because robots.txt will contain a link to it. So put it wherever it's convenient. I sent the robots.txt directive in e-mail but I should capture it here:
Sitemap: https://whatever/sitemap.xml
The robots.txt file must be placed at the root of the final web site (http://collegescorecard.ed.gov/robots.txt).
You can search and find the site, but google can't find specific pages
URL appears to be rendered incorrectly by Google search robot, as seen here
@ultrasaurus How did you get to that point? As near as I can tell the sitemap does not result in a request for the erroneous https://collegescorecard.ed.gov/https://collegescorecard.ed.gov/school/?166027-Harvard-University. Did you arrive at that test from a sitemap link or is it possible you just pasted a full URL into the "render as google" form that is expecting a path?
As to the actual problem (pages aren't appearing in the Google index), I suspect this is due to the need for an API call to render the content of the page; a naive crawl of the page will see the same content regardless of query string. I thought Google was doing a more complex rendering of pages so as to see this content. It's possible this more expensive rendering process hasn't happened yet. I'm going to try reading up some more about what we should expect to happen here.
Also note for the Webmaster Tools we probably want to add the HTTPS url, not (just) the HTTP URL. Google considers them wholly different sites. @meiqimichelle
@dnesting good catch on http.
@meiqimichelle can you add us to the https version of the site?
@dnesting you are correct -- copy/paste error on my part, now I'm seeing the re-direct to https
@LisaGee It seems like all the stuff in this issue has been done, but I'm not 100% sure. Can you verify and (hopefully) close the issue? Thanks!
I don't think this is working as expected yet. Google has only indexed the (~empty) /school/ page and has not indexed the individual schools beneath it. I believe this is chiefly because the page content is generated entirely with Javascript, which is something search engines have traditionally eschewed. Google effectively sees them all as identical, because the page itself is identical.
However, I am aware that search engines have started rendering Javascript-based pages and I'm hoping we can still make this work. This morning I went in and re-submitted the sitemap under the https:// site in the hopes that this will get Google to try harder with these URLs, and I used the "Fetch as Google" feature to retrieve https://www.google.com/webmasters/tools/googlebot-fetch-details?hl=en&authuser=2&siteUrl=https://collegescorecard.ed.gov/&path=school/?117803-Los-Angeles-County-College-of-Nursing-and-Allied-Health×tamp=1447092565067, which succeeded and displayed a properly rendered page. I'm hopeful but we'll need a few days maybe to see any results.
@dnesting ah thanks! effective and informative :+1: :us:
Yeah, I keep hearing about how Google is indexing dynamic content, but this doesn't appear to be the case for our school pages and I don't know how to ensure that it does.
Here's a kind of crazy thought: could we generate static pages for each school with Jekyll using a subset of the CSV data as input? It'd take a long time to build, and it would require some work translating the data, but it's certainly possible.
Another way of putting it would be: If indexing the school pages turns out to be impossible with our current approach, then we might consider moving this to an entirely static model. That would mean eliminating a lot of JavaScript, but I actually kind of see that as a good thing for the long-term health and maintainability of the project.
Discuss! :trollface:
Hypothesis: This line on the /school/ page might be breaking Google's willingness to consider each page distinct:
<link rel="canonical" href="https://collegescorecard.ed.gov/school/">
Hey @meiqimichelle, can we talk about that canonical link? I think @dnesting might be onto something.
This is ready to test once the changes in #1458 get merged to production.
So, school pages aren't being indexed yet. I have a feeling it's because of the way that we're formatting our URLs, e.g.:
/school/?449302-Daymar-College-Madisonville
This style doesn't "name" the URL parameter, and I have a feeling that that's causing Google to skip it.
Google Search Console (FKA "Webmaster Tools"?) has a section for telling the indexer about URL parameters, and I have a feeling that if we switched to using a named URL parameter like id
:
/school/?id=449302-Daymar-College-Madisonville
then we could tell Google that this parameter changes the contents of the page, and force it to index each URL independently. @dnesting and @vgvg, does this sound right to you?
Another thing worth investigating is whether Google Search Console distinguishes between protocols (HTTP vs. HTTPS). Currently, the site is being "managed" under its http://
URL, and the sitemap console warns about redirects to the corresponding https://
URLs. Might this be an issue too?
@shawnbot This was something I considered too. The way I read things, this is a signal that the URL parameter uniquely identifies a page, not an instruction to consider variants of the page distinct. In other words, I don't know that we have evidence that the URL as it's written today is inadequate in that regard.
I see that the canonical change is in prod now and I'm going to poke around with the sitemap to see if I can't get it to try reindexing. Let me know if I'm wasting time by repeating stuff you've already done.
A search for site:collegescorecard.ed.gov now has a school that I explicitly submitted to the index via the "Fetch as Google" feature. This didn't happen before. I suspect this means our problem may be resolved and we just need to wait for Google to get around to reindexing the rest of the pages. I've resubmitted the sitemap in the hopes that that triggers a reindex.
Thanks, @dnesting! We might also consider having the school page redirect if it doesn't get a URL parameter, because this is confusing to see in the search results:
reported as separate issue: https://github.com/18F/college-choice/issues/1479
@LisaGee @barkimedes Can we adjust the title to say something like: "As a Consumer, I should be able to search on Google for a specific school and see College Scorecard results"
acceptance:
@ultrasaurus not quite happening yet (womp womp :disappointed: ) We get some schools showing up under site:collegescorecard.ed.gov, but not all 4k.
I actually think that's a new ticket.
@barkimedes - Would you mind creating a new ticket?
Thanks.
@barkimedes should we close this ticket in favor of #1495?
Ooops, thought I had. yup. Closing.
We need to make sure the site is SEO goodly:
robots.txt
verify that schools show up in Google search results#1495 takes this over.