issues
search
bbcarchdev
/
anansi
A Linked Open Data Web crawler
https://bbcarchdev.github.io/anansi/
Apache License 2.0
0
stars
0
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Handle 307 (Temporary Redirect) and 308 (Permanent Redirect) properly
#70
nevali
opened
7 years ago
1
Clustering fails with divide by 0 error from the database
#69
CygnusAlpha
opened
7 years ago
11
Anansi sometimes fails to follow 303 redirects
#68
townxelliot
opened
7 years ago
19
Fragments should be ignored when populating location and content_location metadata fields
#67
nevali
closed
7 years ago
1
Allow partitioning of the cluster based upon newness
#66
nevali
opened
8 years ago
0
Allow the queue to prioritise new URIs over refreshes
#65
nevali
closed
8 years ago
2
Added code for plugins and an example plugin logging URI being added
#64
cgueret
closed
8 years ago
1
Anansi does not respect the control rate
#63
cgueret
closed
7 years ago
4
URI parsing failures
#62
cgueret
opened
8 years ago
4
Link not parsed in header, causes rejects on licensing
#61
cgueret
closed
8 years ago
1
Add a parameter to control the exploration of the crawler
#60
cgueret
closed
7 years ago
2
Allow a plug-in to perform actions when a new URI is added to the queue
#59
nevali
opened
8 years ago
2
crawler-add should ignore blank lines
#58
nevali
opened
8 years ago
0
When the crawler is terminated, the cluster_leave() is not invoked
#57
nevali
opened
9 years ago
0
Handle HTTP/HTTPS licenses
#56
cgueret
closed
9 years ago
1
Rearrange libraries and executables to make purpose clearer
#55
nevali
opened
9 years ago
0
queue schema does not support URIs longer than 255 chars
#54
simeonvandersteen
closed
9 years ago
0
Add a common memory allocation API which invokes abort() on failure
#53
nevali
opened
9 years ago
0
Add a setting to halt when a schema update is required, rather than migrating automatically
#52
nevali
closed
7 years ago
0
Add message IDs for anything more severe than LOG_DEBUG
#51
nevali
opened
9 years ago
1
Move clustering support to a separate library which can be used by other components (e.g., Twine)
#50
nevali
closed
9 years ago
0
Support storing cached resources in WARC format
#49
nevali
opened
9 years ago
0
Use a different state than REJECTED when a redirect payload is ignored
#48
nevali
closed
8 years ago
0
Harmonise command-line options across Anansi, Twine, Quilt
#47
nevali
opened
9 years ago
0
Add rel="meta" as an equivalent to rel="alternate" in HTML parsing
#46
nevali
opened
9 years ago
0
Triples with a subject based in http://233a..1280;9a39~20150531t140000z--pt0h30m0s
#45
cgueret
closed
9 years ago
5
Some original language tags are being crawled and imported
#44
cgueret
closed
9 years ago
2
Add support for an external license look-up service
#43
nevali
opened
9 years ago
0
Add customisable TTLs per root
#42
nevali
opened
9 years ago
0
Set earliest-fetch date on a root prior to fetching a resource from that root to avoid race condition
#41
nevali
closed
7 years ago
0
Explicitly remove an instance from a cluster when it shuts down, rather than relying on registry refresh timeout
#40
nevali
opened
9 years ago
0
Either terminate or recover after cluster heartbeat failures
#39
nevali
opened
9 years ago
0
Log messages about clustering are formatted inconsistently
#38
nevali
closed
9 years ago
0
De-priortise fetching from sites which have a low success rate
#37
nevali
opened
9 years ago
0
Make use of the 'rate' queue field, including a threshold
#36
nevali
closed
9 years ago
0
Track the number of successful retrievals versus total per crawl root (i.e., success rate)
#35
nevali
opened
9 years ago
0
Notice-level event logging should be restricted to outcomes
#34
nevali
closed
9 years ago
0
Disk cache module emits log noise even when not in use
#33
nevali
closed
9 years ago
0
Cluster re-balancing is insufficiently robust
#32
nevali
closed
9 years ago
1
Add ability to prioritise new resources
#31
nevali
closed
7 years ago
0
LOD: License validation does not canonicalise URL forms
#30
nevali
opened
9 years ago
1
Link header URIs should be evaluated as being relative to content-location
#29
nevali
closed
9 years ago
0
Consolidate RFC822-style header parsing
#28
nevali
opened
9 years ago
0
Crawler RDF processor Accept list should not be hard-coded
#27
nevali
opened
9 years ago
0
Perform schema migration before forking
#26
nevali
opened
9 years ago
0
Crawler does not terminate when all threads have died
#25
nevali
closed
9 years ago
1
Resource TTLs and back-offs should be configurable
#24
nevali
opened
9 years ago
0
Transaction failures aren't handled consistently between Postgres and MySQL
#23
nevali
opened
9 years ago
0
crawler-add -c parameter has no effect
#22
nevali
opened
9 years ago
0
Cluster threads are started before detach
#21
nevali
closed
9 years ago
0
Next