Open marked opened 5 years ago
$ grep 180775611221 wget.log
2018-12-10 23:30:54 URL:http://9volt-art.tumblr.com/post/180775611221/jargwellprescott-staff-can-i-get-an-extra-hour [93135] -> "/mnt/crawls/tumblr/f28e181/cmp/kiska/OR/tumblr-grab/data/1544506251b631f5e9ee388dfb-1/tumblr-blog_9volt-art/wget.tmp" [1]
2018-12-10 23:31:59 URL:http://9volt-art.tumblr.com/post/180775611221/jargwellprescott-staff-can-i-get-an-extra-hour [93195] -> "/mnt/crawls/tumblr/f28e181/cmp/kiska/OR/tumblr-grab/data/1544506251b631f5e9ee388dfb-1/tumblr-blog_9volt-art/wget.tmp" [1]
2018-12-10 23:32:07 URL:https://www.tumblr.com/oembed/1.0?url=http://9volt-art.tumblr.com/post/180775611221/jargwellprescott-staff-can-i-get-an-extra-hour [912] -> "/mnt/crawls/tumblr/f28e181/cmp/kiska/OR/tumblr-grab/data/1544506251b631f5e9ee388dfb-1/tumblr-blog_9volt-art/wget.tmp" [1]
2018-12-10 23:32:07 URL:http://9volt-art.tumblr.com/post/180775611221/jargwellprescott-staff-can-i-get-an-extra-hour/amp [41781] -> "/mnt/crawls/tumblr/f28e181/cmp/kiska/OR/tumblr-grab/data/1544506251b631f5e9ee388dfb-1/tumblr-blog_9volt-art/wget.tmp" [1]
2018-12-10 23:32:07 URL:http://9volt-art.tumblr.com/post/180775611221/jargwellprescott-staff-can-i-get-an-extra-hour [93371] -> "/mnt/crawls/tumblr/f28e181/cmp/kiska/OR/tumblr-grab/data/1544506251b631f5e9ee388dfb-1/tumblr-blog_9volt-art/wget.tmp" [1]
2018-12-10 23:32:08 URL:http://9volt-art.tumblr.com/post/180775611221/jargwellprescott-staff-can-i-get-an-extra-hour?route=%2Fpost%2F%3Aid%2F%3Asummary [93666] -> "/mnt/crawls/tumblr/f28e181/cmp/kiska/OR/tumblr-grab/data/1544506251b631f5e9ee388dfb-1/tumblr-blog_9volt-art/wget.tmp" [1]
2018-12-10 23:32:14 URL:http://9volt-art.tumblr.com/notes/180775611221/cJDy1hC8G?from_c=1543890788 [13070] -> "/mnt/crawls/tumblr/f28e181/cmp/kiska/OR/tumblr-grab/data/1544506251b631f5e9ee388dfb-1/tumblr-blog_9volt-art/wget.tmp" [1]
Kinda looks like the last one was generated from the /amp page /amp$ should be blocked
Edit: Kiska requested the referrer to be examined:
Referer: http://9volt-art.tumblr.com/
Referer: http://9volt-art.tumblr.com/rss
Referer: http://9volt-art.tumblr.com/post/180775611221/jargwellprescott-staff-can-i-get-an-extra-hour
1) looks legitimate 2) /rss implies it's only the first few posts affected but /rss could be blocked 3) is a mystery
This can happen due to redirects. A redirected to page is not checked for being crawled already.
the number of repeats is greater than having 1x redirect. still seeing 2-3 results with 200.
I could be reading wrong but this looks like the same post was crawled 3x.
https://gist.github.com/marked/1da8a0c95ddf2fb93714d9ff1ca212c4