[FEATURE REQUEST] Adding `sitemap.xml` for link extraction

epi052 / feroxbuster

A fast, simple, recursive content discovery tool written in Rust.

https://epi052.github.io/feroxbuster/

MIT License

5.72k stars 475 forks source link

[FEATURE REQUEST] Adding `sitemap.xml` for link extraction #997

Open n3rada opened 10 months ago

n3rada commented 10 months ago

Hi there! 👋

Thanks for all your work on feroxbuster. I was wondering, maybe I've seen it wrong but it seems to me that link extraction takes robots.txt into account but not sitemap.xml. 🤔

That would be "simple" and interesting to add, wouldn't it?

epi052 commented 10 months ago

hey there, sorry, i forgot to reply to this one. parsing the sitemap would absolutely be an interesting feature to add. I see you submitted a PR 🎉 , ill be able to take a look soon. Thank you !

epi052 commented 10 months ago

Ok, adding some thoughts here on how to go about parsing sitemaps.

there are various sitemap formats:
- traditional xml
- rss feed (still xml, different tags)
- plaintext
sitemaps can be located inside any directory on the webserver. when located in a non-root directory, they only describe locations within that directory
urls in a sitemap are entity escaped (i.e. " becomes ")
sitemaps may be gzipped if they're huge, but have a max uncompressed size of 50MB/50,000 URLs
a sitemap may be a sitemap index. a sitemap index points to multiple sitemaps

It may seem like a lot, but if we're going to add sitemap parsing, I'd like to get the majority of the cases above handled.

n3rada commented 10 months ago

I clearly underestimated the sitemap.xml then. I guess I'm not going to be very useful in this job after all. But I think it would be amazing to have this feature as well! 😊

stale[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

L1-0 commented 3 months ago

This would be a great feature!