edx / edx-arch-experiments

A plugin to include applications under development by the architecture team at edx
GNU Affero General Public License v3.0
0 stars 3 forks source link

Prevent web crawlers from indexing courses.edx.org and learning.edx.org sub-domains #852

Closed jristau1984 closed 2 days ago

jristau1984 commented 1 week ago

Hello! I'm writing because I'm chasing down the answer to the following question: can we request that account-locked subdomains have a noindex tag? For context, when we've run ScreamingFrog crawls with no depth limits, we end up seeing the subdomains pop up, and they raise indexing flags, particularly https://learning.edx.org/ and https://courses.edx.org/. The urls that come up are account-locked pages, but they're indexed on the SERP despite being inaccessible. Chris Sim and his manager flagged these in a report she wrote; they recommended ensuring that these subdomains be tagged with noindex, since they shouldn't be rankable.

After discussing this with the Open edX maintenance group, the proposal is to handle this with internal edx.org configuration.

From Chris Sim:

it would be good for this to happen before the 19th but not enough to spike it into a current sprint if it's too disruptive. It would still need to get done even after the edx website migration because RV still wont have subdomain control at that point.

timmc-edx commented 1 week ago

For courses.edx.org the robots.txt appears to be controlled by NGINX_ROBOT_RULES in Ansible vars.

robrap commented 1 week ago

The initial question looks like it is referring to the noindex tag, and not necessarily to robots.txt.

jristau1984 commented 1 week ago

There are different levels of effect for different approaches:

  1. robots.txt will stop crawlers going forward, but will not clean out existing indexed data. This would be a partial solution
  2. noindex is more comprehensive and faster to goal, as it will remove them from the list which improves SEO ranking.

First step:

  1. Update robots.txt to stop crawling these subdomains. "complete disallow" would apply to all robots, which is the request. Replace what is there with:

    User-agent: *
    Disallow: /

    Ping Christopher Sim to perform validations.

  2. In parallel, discuss with Aurora and SRE about putting in noindex options, either in the MFE or at the CDN

jristau1984 commented 4 days ago

Aurora has taken the Learning MFE portion of this work. courses.edx.org is in flight for today (Monday).

timmc-edx commented 3 days ago

Scoped ticket back down to just noindex; robots.txt changes may happen in the future. LMS side of this is complete.

jristau1984 commented 2 days ago

RV and SEO have approved this, and it was deployed to Prod.