issues
search
ScottMansfield
/
widow
Distributed, asynchronous web crawler
GNU Lesser General Public License v2.1
26
stars
4
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Throttle requests to a domain by total bandwidth in a specified period of time
#19
ScottMansfield
opened
9 years ago
0
All outbound requests should have a User-Agent attached to them
#18
ScottMansfield
opened
9 years ago
0
Exclude javascript: links
#17
ScottMansfield
closed
9 years ago
0
Links with rel="nofollow" should not be followed
#16
ScottMansfield
closed
9 years ago
0
Check for robots meta tag while parsing a page
#15
ScottMansfield
opened
9 years ago
1
Syntax highlighting for original content returned from the site
#14
ScottMansfield
closed
9 years ago
0
Add support for @import in CSS files and inline <style> contents
#13
ScottMansfield
opened
9 years ago
0
Add ability to restrict crawling to a single domain or domain / sub-domains
#12
ScottMansfield
closed
9 years ago
1
mailto:, phone:, etc break parsing
#11
ScottMansfield
opened
9 years ago
0
Implment rate-limiting on a per-host basis
#10
ScottMansfield
opened
9 years ago
1
Add links by content type to the main page data
#9
ScottMansfield
opened
9 years ago
0
Filter anchor links out of the OUT_LINKS field
#8
ScottMansfield
closed
9 years ago
0
a tags with img tags containing the same image should not be sent back to the fetch stage
#7
ScottMansfield
closed
9 years ago
1
Have a better story around local caching independent of the crawling stages
#6
ScottMansfield
opened
9 years ago
1
Investigate more accurate timing of website response
#5
ScottMansfield
opened
9 years ago
1
Support for sitemap.xml
#4
ScottMansfield
opened
9 years ago
2
Add support for If-Modified-Since and ETag headers
#3
ScottMansfield
opened
9 years ago
1
Add support for robots.txt for any website
#2
ScottMansfield
opened
9 years ago
3
Split links into raw and normalized
#1
ScottMansfield
closed
9 years ago
0