issues
search
commoncrawl
/
news-crawl
News crawling with StormCrawler - stores content as WARC
Apache License 2.0
312
stars
34
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Route tuples to the status updater bolt based on URLs
#65
jnioche
closed
7 months ago
0
Have as many WARCBolt instances as there are workers
#64
jnioche
closed
7 months ago
0
news-crawl 2.x Broken when using multiple workers (across multiple hosts)
#63
jnioche
closed
7 months ago
17
Port topology and resources to StormCrawler 2.10
#60
jnioche
opened
8 months ago
2
Nutch-compatible implementation of FastURLFilter + use it in PreFilterBolt
#59
jnioche
closed
8 months ago
0
Avoid following advertisements in news feeds and sitemaps
#58
sebastian-nagel
opened
8 months ago
0
News archive is not available since 2023-10-23 15:36:50
#57
zikolach
closed
9 months ago
1
mvn clean package fails on Mac on Apple M1 Pro chip
#56
raphaelzhou1
closed
11 months ago
5
produce WET files?
#55
chris-ha458
opened
1 year ago
6
Consider archiving of news feeds and sitemaps
#54
sebastian-nagel
opened
1 year ago
0
Explore schema.org annotations for seed completions
#53
sebastian-nagel
opened
1 year ago
0
Bump jackson-databind from 2.11.1 to 2.12.7.1
#52
dependabot[bot]
closed
8 months ago
1
Bump jackson-databind from 2.11.1 to 2.13.4.1
#51
dependabot[bot]
closed
1 year ago
1
Use wikidata to complete seeds
#50
sebastian-nagel
opened
1 year ago
1
Bump jackson-databind from 2.11.1 to 2.12.6.1
#49
dependabot[bot]
closed
1 year ago
1
How large is the dataset
#48
sljlp
closed
1 year ago
2
Run docker in a non-interactively way
#47
scarcenine
closed
1 year ago
1
News archive is not available since 06.06.2021
#46
zikolach
closed
3 years ago
3
How to get a listing of WARC/WAT/WET files using HTTP for News Dataset ?
#45
brand17
closed
3 years ago
2
Odd duplicate content behaviour on www.diariodeavila.es domain
#44
staylor-ds
closed
3 years ago
4
Error in build docker
#43
Ashbajawed
closed
3 years ago
3
Do not use "http/2" protocol version in HTTP headers in WARC files
#42
sebastian-nagel
opened
3 years ago
2
Allow to follow news sites not providing RSS/Atom feed or news sitemap
#41
sebastian-nagel
opened
4 years ago
2
Automatic removal of ephemeral sitemaps
#40
sebastian-nagel
opened
4 years ago
0
Bump jackson-databind from 2.9.10.1 to 2.10.0.pr1
#39
dependabot[bot]
closed
3 years ago
1
unable to fetch data from elasticsearch , no content is showing
#38
kaal-zathura24
closed
4 years ago
2
Bump jackson-databind from 2.9.10.1 to 2.9.10.4
#37
dependabot[bot]
closed
4 years ago
1
Bump jackson-databind from 2.9.10.1 to 2.9.10.3
#36
dependabot[bot]
closed
4 years ago
1
NewsSiteMapParserBolt: do not detect feeds as sitemaps
#35
sebastian-nagel
opened
4 years ago
0
Add HTTP protocol version to HTTP request message
#34
sebastian-nagel
opened
4 years ago
1
Bump jackson-databind from 2.9.9.2 to 2.9.10.1
#33
dependabot[bot]
closed
4 years ago
1
Check cross-submits for sitemaps
#32
sebastian-nagel
opened
5 years ago
1
WARC file format improvement: add WARC-Truncated header
#31
sebastian-nagel
closed
5 years ago
1
WARC file format fix: mask HTTP header fields Content-Encoding and Transfer-Encoding, adjust Content-Length
#30
sebastian-nagel
closed
5 years ago
1
WARC file format fix: add WARC-IP-Address
#29
sebastian-nagel
closed
5 years ago
1
Endless refetch of URLs due to changing domain names
#28
sebastian-nagel
closed
5 years ago
2
Error: Could not find or load main class com.digitalpebble.stormcrawler.elasticsearch.ESSeedInjector
#27
rishrockstar
closed
5 years ago
2
amazing dataset!
#26
randomgambit
closed
6 years ago
1
Full support for sitemap extensions and namespaces
#25
sebastian-nagel
closed
5 years ago
0
Crawl-delay in robots.txt should not shrink delay configured by fetcher.server.delay
#24
sebastian-nagel
closed
5 years ago
1
NewsSiteMapParserBolt fails to parse valid XML sitemap
#23
sebastian-nagel
closed
5 years ago
1
Should follow subsitemaps in sitemap index
#22
sebastian-nagel
closed
6 years ago
0
URL filter: exclude localhost and private addresses
#21
sebastian-nagel
closed
6 years ago
2
Upgrade to Storm 1.2.1 and ES 6.1.1
#20
sebastian-nagel
closed
6 years ago
3
AdaptiveScheduler not applied to RSS/Atom feeds
#19
sebastian-nagel
closed
6 years ago
2
Extract publishing date
#18
fhamborg
opened
7 years ago
2
Provide indexing outside AWS
#17
john-hewitt
closed
7 years ago
3
URLs with trailing white space continuously re-fetched
#16
sebastian-nagel
closed
7 years ago
4
Adaptive fetch schedule for feeds
#15
sebastian-nagel
closed
7 years ago
2
Ensure that seeds are refetched from time to time even if failed or redirected
#14
sebastian-nagel
closed
7 years ago
1
Next