fossar / selfoss

multipurpose rss reader, live stream, mashup, aggregation web application
https://selfoss.aditu.de
GNU General Public License v3.0
2.38k stars 345 forks source link

Respecting robots.txt files #1496

Closed TechnologyClassroom closed 3 months ago

TechnologyClassroom commented 3 months ago

I looked and I did not see anything about robots.txt files in the issues.

I see web traffic on one of the servers I manage claiming to be a selfoss instance which is scraping requesting /wiki/Special: pages. Our robots.txt file explicitly disallows robots from scraping those pages.

Disallow: /wiki/Special:

Is this an issue with selfoss or is this not a selfoss instance?

I would be happy to supply some redacted logs over email if it would help.

Edit: Replacing scraping with requesting.

jtojnar commented 3 months ago

Hi, as a feed reader, selfoss does not crawl the web – it only periodically fetches URLs of the feeds that user provides. As such, I would say following robots.txt makes only slightly more sense than it would for a read-it-later app or a web browser.

So if selfoss is hitting a page, it most likely means a user configured it to do so.

There is also a chance that user specified your homepage as the source URL and, since it is not a feed, SimplePie library’s smart feed discovery picks a special link from the page for some reason.

Feel free to send me the logs to jtojnar@gmail.com, I can take a look.

TechnologyClassroom commented 3 months ago

Ah, I see the pattern. It is one person doing these two requests at 30 minute or hourly intervals repeating.

directory.fsf.org:80 REDACTED - - [08/Aug/2024:19:30:01 -0400] "GET /wiki/Special:Ask/-5B-5BLast-20review-20date::%2B-5D-5D/format%3Drss/sort%3DLast-20review-20date/order%3Ddescending/searchlabel%3DRecent-20updates-20RSS-20feed/title%3DFree-20Software-20Directory/description%3DRecent-20updates-20to-20Free-20Software-20Directory-20(directory.fsf.org)/offset%3D0 HTTP/1.1" 301 1237 "http://directory.fsf.org/wiki/Special:Ask/-5B-5BLast-20review-20date::%2B-5D-5D/format%3Drss/sort%3DLast-20review-20date/order%3Ddescending/searchlabel%3DRecent-20updates-20RSS-20feed/title%3DFree-20Software-20Directory/description%3DRecent-20updates-20to-20Free-20Software-20Directory-20(directory.fsf.org)/offset%3D0" "Selfoss/2.19 (+https://selfoss.aditu.de)"
directory.fsf.org:80 REDACTED - - [08/Aug/2024:19:30:08 -0400] "GET /wiki/Special:Ask/-5B-5BSubmitted-20date::+-5D-5D/format=rss/sort=Submitted-20date/order=descending/searchlabel=New-20packages-20RSS-20feed/title=Free-20Software-20Directory/description=Recent-20updates-20to-20Free-20Software-20Directory-20%28directory.fsf.org%29 HTTP/1.1" 301 1199 "http://directory.fsf.org/wiki/Special:Ask/-5B-5BSubmitted-20date::+-5D-5D/format=rss/sort=Submitted-20date/order=descending/searchlabel=New-20packages-20RSS-20feed/title=Free-20Software-20Directory/description=Recent-20updates-20to-20Free-20Software-20Directory-20%28directory.fsf.org%29" "Selfoss/2.19 (+https://selfoss.aditu.de)"

I mistook it for crawling. That's fine at that scale. If your program gets really popular, I'll come back to request a robots.txt file feature.

Thanks for getting back to me! Closing the issue.