Closed jristau1984 closed 2 days ago
For courses.edx.org the robots.txt appears to be controlled by NGINX_ROBOT_RULES
in Ansible vars.
The initial question looks like it is referring to the noindex tag, and not necessarily to robots.txt.
There are different levels of effect for different approaches:
robots.txt
will stop crawlers going forward, but will not clean out existing indexed data. This would be a partial solutionnoindex
is more comprehensive and faster to goal, as it will remove them from the list which improves SEO ranking.First step:
Update robots.txt
to stop crawling these subdomains. "complete disallow" would apply to all robots, which is the request. Replace what is there with:
User-agent: *
Disallow: /
Ping Christopher Sim to perform validations.
In parallel, discuss with Aurora and SRE about putting in noindex
options, either in the MFE or at the CDN
Aurora has taken the Learning MFE portion of this work. courses.edx.org is in flight for today (Monday).
Scoped ticket back down to just noindex; robots.txt changes may happen in the future. LMS side of this is complete.
RV and SEO have approved this, and it was deployed to Prod.
After discussing this with the Open edX maintenance group, the proposal is to handle this with internal edx.org configuration.
From Chris Sim: