Closed GJFR closed 3 years ago
Would it be possible to merge this in time for the September crawl?
Hi @GJFR. In discussion with @VictorLeP on slack, he pointed me to this PR. I have updated a PR for a robots.txt extraction to capture some data for next year's web almanac.
I have finalized the PR here and included test cases: https://github.com/HTTPArchive/legacy.httparchive.org/pull/236
The current robot.txt code in the well-known extraction script has two issues:
1.
# Works for this
user-agent: Googlebot
disallow: /login/
# Does not work for this. Will attribute the 'login' match to only Bingbot.
user-agent: Googlebot
user-agent: Bingbot
disallow: /login/
/book-cataloging/
and genuine-authentic-nike-shoes
as secure routes.I could update my robots.txt script to have the following output. I have already addressed #1 in my PR. I think I can make #2 better, but probably not perfect.
"by_useragent": {
"*": {
"allow": 1,
"disallow": 45,
"crawl_delay": 0,
"noindex": 0,
"well_known_disallows: 5,
"other": 0
},
"baiduspider": {
"allow": 0,
"disallow": 163,
"crawl_delay": 0,
"noindex": 0,
"well_known_disallows: 2,
"other": 0
},
...
}
I wanted to get a POV from the Security Team if we should consolidate or edit the well-known to correct some things.
Hi @jroakes. Thanks for the heads-up. As for the issues you mentioned:
This issue should indeed be corrected, I was not aware that the second example is a valid robots.txt
configuration. In the context of this year's Security-chapter, we can stil use the data luckily, since we're only interested in the endpoints that are covered.
This issue was also discussed in the original PR: https://github.com/HTTPArchive/legacy.httparchive.org/pull/219. It was decided to do the filtering at the querying stage, such that the data collected by the custom metric stays consistent over the years. Because like you mentioned, this kind of filtering is difficult to get perfect.
Thanks @GJFR I think you make a really good point for pt.2 that I had not considered. FWIW, this is what I had tested in case it is helpful. It solves pt.1, and my solution for pt.2 was to 1) include oauth
in keywords, and 2) exclude matches to matches that were preceded by a non-word character.
This also includes regex that should match:
disallow : useragent
disallow: useragent
disallow: useragent
But doesn't clean first to solve for commented values (I did this in robots_txt.js):
disallow: /path/ #This is a secure route
text = $('html').textContent
let data = {'matched_disallows': {}};
let KEYWORDS = [
'login',
'log-in',
'signin',
'sign-in',
'admin',
'auth',
'oauth',
'sso',
'account',
'secure'
]
let currUserAgents = [];
let last_rule = null;
const rule_rx = new RegExp(`(user-agent|disallow)(?=\\s*:)`,'gi');
const kws = KEYWORDS.join('|');
const kw_rx = new RegExp(`[^\\w]+(${kws})`,'gi');
for(let line of text.split('\n')) {
let matched = line.toLowerCase().match(rule_rx);
if (matched){
let rule = matched[0].trim();
let value = line.slice(line.indexOf(':') + 1).trim();
if (rule == 'user-agent') {
if (last_rule == rule){
currUserAgents.push(value);
}else{
currUserAgents = [value];
}
} else if (rule == 'disallow') {
if (value.match(kw_rx)) {
for (currAgent of currUserAgents){
if (data['matched_disallows'][currAgent] === undefined) {
data['matched_disallows'][currAgent] = [];
}
data['matched_disallows'][currAgent].push(value);
}
}
}
last_rule = rule;
}
}
JSON.stringify(data);
Here is test output for Linkedin: https://gist.github.com/jroakes/bb5ac67ee4584f58f5a7714d62605da1
I am not that great with JS, so sharing for a POC example only that can be improved.
Unfortunately, the
well-known.js
custom metric did not so well in the previous crawls due to errors. If oneparseResponse
call fails, the whole metric fails. To solve this, I have separated error handling for each call.The metric would also fail when the site's policy wouldn't allow to load another page (e.g. CORS). This is now also handled for each
parseResponse
call. Since it is not possible to resolve the underlying cause of these errors, I just indicate an error has occurred while fetching a URL.Lastly, I also fixed a capitalisation issue for string comparisons in
/robots.txt
.I also see a lot of 'The user aborted a request.' errors in the our crawl data. Does this have to do with
fetchWithTimeout
?WPT tests (focus on the errors in previous crawls)