RTICWDT / college-scorecard

College Scorecard
https://collegescorecard.ed.gov/
312 stars 75 forks source link

As a Consumer, I should be able to find the PICC tool using a search engine. #496

Closed LisaGee closed 8 years ago

LisaGee commented 9 years ago

We need to make sure the site is SEO goodly:

LisaGee commented 9 years ago

@shawnbot - got thoughts?

meiqimichelle commented 9 years ago

132 was meant to address this.

shawnbot commented 9 years ago

One glaring issue here is that without at least some subset of the school data columns in this repo, we can't statically generate a sitemap for Google to crawl. If we had the school names and IDs only, we could at least create one that would list the URLs of each school.

dnesting commented 9 years ago

They only benefit I can see of a Sitemap here is ensuring all of the school-specific URLs are listed someplace so that they get indexed. Like @shawnbot says, without access to the data (at least sufficient to generate the URLs), we can't create a static page for this. Some thoughts:

  1. Make this dynamically-generated from the data
  2. Make this a static page checked in to GitHub that gets rebuilt somehow whenever the data changes

I also would need to know what the school "permalink" is. It it just https://college-choice.18f.gov/school/?117803-Los-Angeles-County-College-of-Nursing-and-Allied-Health (with the expectation that the hostname will be different)?

shawnbot commented 9 years ago

Yeah @dnesting, I think we'd need to check in a CSV with just the name and id columns of the schools we care about. The permalink URLs could then be generated in Jekyll with something like:

/school/?{{ school.id }}-{{ school.name|replace:' ','-' }}
ultrasaurus commented 9 years ago

I think you want a query like: https://ccapi-dev.18f.gov/v1/schools?fields=id,school.name&per_page=10

LisaGee commented 9 years ago

@dnesting Do you have what you need? If you need, @hollyallen can create a .csv file for you that has all the school names.

dnesting commented 9 years ago

I chatted with @ultrasaurus about this yesterday and I think the approach we're going to run with looks like this:

LisaGee commented 9 years ago

@dnesting Can you point @meiqimichelle to where the file is and she can deploy it into our repo. This should be done today if at all possible. Thanks.

dnesting commented 9 years ago

The file is generated by the make_sitemap.rb script in pull request #3. What we can do is finalize that PR as it stands (defer the copy-to-s3 logic until a later PR) just so that we have something checked in. I'll do that now. We also need a robots.txt change to point search engines to the sitemap.xml URL that we end up using. Those last two pieces (what bucket/name to use for the file, and what the user-visible URL is corresponding to that) were all that we were waiting on for this.

meiqimichelle commented 9 years ago

@dnesting Is there impact in where we put the sitemap.xml, or we just need a place to host a file? If we just need a place, can we use the s3 bucket that we're using to host the full data dump and small sample csv's ? If yes, @diego- knows how to put files there (where it is, etc).

I can change the robots.txt after we have the url. Do you know the proper robot.txt syntax or should I investigate?

LisaGee commented 9 years ago

Generally, I’ve always place the robots.txt file in the root directory of the website, which I’m still of the belief will also be in an S3 bucket. Can we do that?

meiqimichelle commented 9 years ago

@LisaGee I'm not worried about the robotx.txt placement -- it will go with the rest of the site, and it is already in root. I'm confused about the xml file then....I was thinking we wanted it somewhere else because it is big and maybe we wouldn't want to push it with the rest of the app every time we update the site? I suppose that doesn't mean we can't place it in the same s3 bucket. In any case, for now I'd rather stick it somewhere so we can close this issue. Putting it with the data files seems OK to me. Maybe that's the s3 bucket we're intending to use for the site in the long run anyways.

LisaGee commented 9 years ago

Got it. I think I've typically put the sitemap.xml in root too, but I defer to anyone else who has a strong POV.

meiqimichelle commented 9 years ago

Agree. What do you think, @dnesting? Can we just put sitemap.xml at root?

dnesting commented 9 years ago

Ack, sorry for not responding on this earlier. The placement of sitemap.xml doesn't matter because robots.txt will contain a link to it. So put it wherever it's convenient. I sent the robots.txt directive in e-mail but I should capture it here:

Sitemap: https://whatever/sitemap.xml

The robots.txt file must be placed at the root of the final web site (http://collegescorecard.ed.gov/robots.txt).

ultrasaurus commented 9 years ago

You can search and find the site, but google can't find specific pages

ultrasaurus commented 9 years ago

URL appears to be rendered incorrectly by Google search robot, as seen here

image

dnesting commented 9 years ago

@ultrasaurus How did you get to that point? As near as I can tell the sitemap does not result in a request for the erroneous https://collegescorecard.ed.gov/https://collegescorecard.ed.gov/school/?166027-Harvard-University. Did you arrive at that test from a sitemap link or is it possible you just pasted a full URL into the "render as google" form that is expecting a path?

As to the actual problem (pages aren't appearing in the Google index), I suspect this is due to the need for an API call to render the content of the page; a naive crawl of the page will see the same content regardless of query string. I thought Google was doing a more complex rendering of pages so as to see this content. It's possible this more expensive rendering process hasn't happened yet. I'm going to try reading up some more about what we should expect to happen here.

dnesting commented 9 years ago

Also note for the Webmaster Tools we probably want to add the HTTPS url, not (just) the HTTP URL. Google considers them wholly different sites. @meiqimichelle

ultrasaurus commented 9 years ago

@dnesting good catch on http.

@meiqimichelle can you add us to the https version of the site?

ultrasaurus commented 9 years ago

@dnesting you are correct -- copy/paste error on my part, now I'm seeing the re-direct to https

barkimedes commented 8 years ago

@LisaGee It seems like all the stuff in this issue has been done, but I'm not 100% sure. Can you verify and (hopefully) close the issue? Thanks!

dnesting commented 8 years ago

I don't think this is working as expected yet. Google has only indexed the (~empty) /school/ page and has not indexed the individual schools beneath it. I believe this is chiefly because the page content is generated entirely with Javascript, which is something search engines have traditionally eschewed. Google effectively sees them all as identical, because the page itself is identical.

However, I am aware that search engines have started rendering Javascript-based pages and I'm hoping we can still make this work. This morning I went in and re-submitted the sitemap under the https:// site in the hopes that this will get Google to try harder with these URLs, and I used the "Fetch as Google" feature to retrieve https://www.google.com/webmasters/tools/googlebot-fetch-details?hl=en&authuser=2&siteUrl=https://collegescorecard.ed.gov/&path=school/?117803-Los-Angeles-County-College-of-Nursing-and-Allied-Health&timestamp=1447092565067, which succeeded and displayed a properly rendered page. I'm hopeful but we'll need a few days maybe to see any results.

barkimedes commented 8 years ago

@dnesting ah thanks! effective and informative :+1: :us:

shawnbot commented 8 years ago

Yeah, I keep hearing about how Google is indexing dynamic content, but this doesn't appear to be the case for our school pages and I don't know how to ensure that it does.

Here's a kind of crazy thought: could we generate static pages for each school with Jekyll using a subset of the CSV data as input? It'd take a long time to build, and it would require some work translating the data, but it's certainly possible.

shawnbot commented 8 years ago

Another way of putting it would be: If indexing the school pages turns out to be impossible with our current approach, then we might consider moving this to an entirely static model. That would mean eliminating a lot of JavaScript, but I actually kind of see that as a good thing for the long-term health and maintainability of the project.

Discuss! :trollface:

dnesting commented 8 years ago

Hypothesis: This line on the /school/ page might be breaking Google's willingness to consider each page distinct:

<link rel="canonical" href="https://collegescorecard.ed.gov/school/">
shawnbot commented 8 years ago

Hey @meiqimichelle, can we talk about that canonical link? I think @dnesting might be onto something.

shawnbot commented 8 years ago

This is ready to test once the changes in #1458 get merged to production.

shawnbot commented 8 years ago

So, school pages aren't being indexed yet. I have a feeling it's because of the way that we're formatting our URLs, e.g.:

/school/?449302-Daymar-College-Madisonville

This style doesn't "name" the URL parameter, and I have a feeling that that's causing Google to skip it. Google Search Console (FKA "Webmaster Tools"?) has a section for telling the indexer about URL parameters, and I have a feeling that if we switched to using a named URL parameter like id:

/school/?id=449302-Daymar-College-Madisonville

then we could tell Google that this parameter changes the contents of the page, and force it to index each URL independently. @dnesting and @vgvg, does this sound right to you?

Another thing worth investigating is whether Google Search Console distinguishes between protocols (HTTP vs. HTTPS). Currently, the site is being "managed" under its http:// URL, and the sitemap console warns about redirects to the corresponding https:// URLs. Might this be an issue too?

dnesting commented 8 years ago

@shawnbot This was something I considered too. The way I read things, this is a signal that the URL parameter uniquely identifies a page, not an instruction to consider variants of the page distinct. In other words, I don't know that we have evidence that the URL as it's written today is inadequate in that regard.

I see that the canonical change is in prod now and I'm going to poke around with the sitemap to see if I can't get it to try reindexing. Let me know if I'm wasting time by repeating stuff you've already done.

dnesting commented 8 years ago

A search for site:collegescorecard.ed.gov now has a school that I explicitly submitted to the index via the "Fetch as Google" feature. This didn't happen before. I suspect this means our problem may be resolved and we just need to wait for Google to get around to reindexing the rest of the pages. I've resubmitted the sitemap in the hopes that that triggers a reindex.

shawnbot commented 8 years ago

Thanks, @dnesting! We might also consider having the school page redirect if it doesn't get a URL parameter, because this is confusing to see in the search results:

image

reported as separate issue: https://github.com/18F/college-choice/issues/1479

ultrasaurus commented 8 years ago

@LisaGee @barkimedes Can we adjust the title to say something like: "As a Consumer, I should be able to search on Google for a specific school and see College Scorecard results"

acceptance:

barkimedes commented 8 years ago

@ultrasaurus not quite happening yet (womp womp :disappointed: ) We get some schools showing up under site:collegescorecard.ed.gov, but not all 4k.

LisaGee commented 8 years ago

I actually think that's a new ticket.

@barkimedes - Would you mind creating a new ticket?

Thanks.

barkimedes commented 8 years ago

1495 created.

abisker commented 8 years ago

@barkimedes should we close this ticket in favor of #1495?

barkimedes commented 8 years ago

Ooops, thought I had. yup. Closing.