ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
552 stars 77 forks source link

Support text file for sitemaps or general link extraction #269

Open chfoo opened 9 years ago

chfoo commented 9 years ago

The protocol allows linking to plain text files for URLs (http://www.sitemaps.org/protocol.html#otherformats). It should be able to understand that the destination is a list of URLs part of the sitemap.

Alternatively in the spirit of #74, it could detect text files in general for URLs and then run extraction on them.

parulsethi commented 8 years ago

Also, can provide an option to extract URLs of particular categories only in a sitemap. Ex.In a News website sitemap, extract the ones with sports tag only. And, can i contribute for this one?