issues
search
ScaleUnlimited
/
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Apache License 2.0
51
stars
18
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Updated Flink version to 1.7.2.
#171
vmagotra
closed
5 years ago
0
Update Flink dependency to 1.7.2
#170
kkrugler
closed
5 years ago
1
Now support either -warccontentpath or -textcontentfile
#169
Schmed
closed
6 years ago
0
WARC support expects output sink without support from CrawlToolOptions
#168
Schmed
closed
6 years ago
0
Checkpoint SeedUrlSource
#167
kkrugler
closed
6 years ago
0
Switch to simpler approach for moving average
#166
kkrugler
closed
6 years ago
1
Save state in MovingAverageAggregator
#165
kkrugler
closed
6 years ago
0
Save domain score state
#164
kkrugler
closed
6 years ago
0
Include UrlDBFunction's domain scores in saved state
#163
Schmed
opened
6 years ago
0
Sync up the total active urls value with state in UrlDbFunction
#162
vmagotra
opened
6 years ago
1
74 more metrics
#161
vmagotra
closed
6 years ago
1
Switch to real source for DomainScore tuples
#160
kkrugler
opened
6 years ago
0
143 add domain quality input
#159
kkrugler
closed
6 years ago
0
Updated crawler-commons to v 0.10
#158
vmagotra
closed
6 years ago
0
No ticklers
#157
kkrugler
closed
6 years ago
1
133 stream test harness
#156
Schmed
closed
6 years ago
2
All crawls are now “focused”
#155
Schmed
closed
6 years ago
0
Try flink-crawler with crawler-commons 0.10 release candidate
#154
kkrugler
closed
6 years ago
3
Get rid of non-focused crawl support in code
#153
kkrugler
closed
6 years ago
0
Output content to S3 in WARC file format.
#152
vmagotra
closed
6 years ago
1
CrawlTopologyTest.testAsync now fails for me
#151
Schmed
opened
6 years ago
5
148 fetch status counters
#150
Schmed
closed
6 years ago
0
Update use of accumulators in parsing code
#149
kkrugler
closed
6 years ago
0
FetchStatus->Queued is decremented, but never incremented
#148
Schmed
closed
6 years ago
2
Can't resolve dependency com.github.crawler-commons:http-fetcher:0.1-SNAPSHOT
#147
IvanBiv
closed
6 years ago
5
Fix unit test(s) broken by Flink 1.4->1.5 upgrade
#146
Schmed
closed
6 years ago
2
Add java formatting support from Maven, reformat code
#145
kkrugler
closed
6 years ago
0
Refactored CrawlTool to pull out private methods to a new class Crawl…
#144
vmagotra
closed
6 years ago
0
Add optional domain quality input to UrlDBFunction
#143
kkrugler
closed
6 years ago
0
Create a CrawlToolUtils class with helper methods to set up page,sitemap and robots fetchers and the url lengthener.
#142
vmagotra
closed
6 years ago
0
Start using Flink 1.5-SNAPSHOT
#141
Schmed
closed
6 years ago
3
Upgrade to Flink 1.5
#140
Schmed
closed
6 years ago
5
Get rid of crawlDB parallelism, clean up naming/comments
#139
kkrugler
closed
6 years ago
0
Verify metrics using reporter approach
#138
kkrugler
opened
6 years ago
2
Switched the goal for the build-helper-maven-plugin to add-test-sourc…
#137
vmagotra
closed
6 years ago
0
Verify that integration tests can be executed via the command line
#136
vmagotra
closed
6 years ago
1
Get rid of crawldbparallelism CLI parameter
#135
kkrugler
closed
6 years ago
0
Cleaner termination
#134
kkrugler
closed
6 years ago
0
Try using stream harness support for unit testing
#133
Schmed
opened
6 years ago
7
Name the stream returned after assigning the sink to parsedUrls - thi…
#132
vmagotra
closed
6 years ago
0
ParseFunction is not displayed in the execution graph
#131
vmagotra
closed
6 years ago
2
128 queued status
#130
Schmed
closed
6 years ago
0
Make FetchQueue a real priority queue
#129
kkrugler
closed
6 years ago
0
Support separate QUEUED vs. FETCHING status
#128
kkrugler
closed
6 years ago
2
Make lots of .debug() calls into .trace(), use slf4j formatting
#127
kkrugler
closed
6 years ago
0
Make FetchQueue a priority queue
#126
kkrugler
closed
6 years ago
0
Pass through code to cleanup constants
#125
kkrugler
closed
6 years ago
0
Add checkpointing test
#124
kkrugler
closed
6 years ago
1
Make CC fetcher responsive to interrupts
#123
kkrugler
opened
6 years ago
0
Add support for rocksdb CLI option
#122
kkrugler
opened
6 years ago
0
Next