ScaleUnlimited / flink-crawler

Continuous scalable web crawler built on top of Flink and crawler-commons
Apache License 2.0
51 stars 18 forks source link

Fix termination issues when job is cancelled #111

Closed kkrugler closed 6 years ago

kkrugler commented 6 years ago

Currently when the flink-crawler job is cancelled, we get a bunch of errors (and other interesting output) in the log (see below). Some comments about this:

  1. I think we need to make CommonCrawlFetcher interruptable.
  2. It seems like we're not able to spend enough time terminating our Executor, before Flink times out.
  3. We have activity from our function threads being logged after everything has been shut down.
2018-03-16 15:46:04,467 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task IterationSource-11 (1/1) (51ab2be3c0afdf20f0ba748970d15f09).
2018-03-16 15:46:04,467 INFO  org.apache.flink.runtime.taskmanager.Task                     - IterationSource-11 (1/1) (51ab2be3c0afdf20f0ba748970d15f09) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,467 DEBUG com.scaleunlimited.flinkcrawler.functions.CheckUrlWithRobotsFunction  - Queueing for robots check: http://www.gemalto.com/companyinfo/media/press-releases (REGULAR)
2018-03-16 15:46:04,468 DEBUG com.scaleunlimited.flinkcrawler.functions.CheckUrlWithRobotsFunction  - Found cached rule for 'http://www.gemalto.com/companyinfo/media/press-releases (REGULAR)', collecting
2018-03-16 15:46:04,485 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code IterationSource-11 (1/1) (51ab2be3c0afdf20f0ba748970d15f09).
2018-03-16 15:46:04,488 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task Source: Seed urls source -> Map (1/1) (2905777a74449d34a406278cbabb943f).
2018-03-16 15:46:04,491 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: Seed urls source -> Map (1/1) (2905777a74449d34a406278cbabb943f) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,491 INFO  org.apache.flink.streaming.runtime.tasks.StreamIterationHead  - Iteration head IterationSource-11 (1/1) removed feedback queue under 2a180d603971d9d1bfa7f1ac44819dec-broker-11-0
2018-03-16 15:46:04,492 INFO  org.apache.flink.runtime.taskmanager.Task                     - IterationSource-11 (1/1) (51ab2be3c0afdf20f0ba748970d15f09) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,492 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for IterationSource-11 (1/1) (51ab2be3c0afdf20f0ba748970d15f09).
2018-03-16 15:46:04,492 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task IterationSource-11 (1/1) (51ab2be3c0afdf20f0ba748970d15f09) [CANCELED]
2018-03-16 15:46:04,492 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: Seed urls source -> Map (1/1) (2905777a74449d34a406278cbabb943f).
2018-03-16 15:46:04,493 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (ebac8c804a0ed5492492de78e246aaf8).
2018-03-16 15:46:04,494 INFO  org.apache.flink.runtime.taskmanager.Task                     - LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (ebac8c804a0ed5492492de78e246aaf8) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,495 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (ebac8c804a0ed5492492de78e246aaf8).
2018-03-16 15:46:04,495 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task DomainDBFunction (1/1) (a8665cb1055bea62957e695eef696674).
2018-03-16 15:46:04,495 INFO  org.apache.flink.runtime.taskmanager.Task                     - DomainDBFunction (1/1) (a8665cb1055bea62957e695eef696674) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,495 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code DomainDBFunction (1/1) (a8665cb1055bea62957e695eef696674).
2018-03-16 15:46:04,495 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task UrlDBFunction (1/1) (6284dbf11c5042f2ae98afa82f4e5ac7).
2018-03-16 15:46:04,495 INFO  org.apache.flink.runtime.taskmanager.Task                     - UrlDBFunction (1/1) (6284dbf11c5042f2ae98afa82f4e5ac7) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,496 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code UrlDBFunction (1/1) (6284dbf11c5042f2ae98afa82f4e5ac7).
2018-03-16 15:46:04,496 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task CheckUrlWithRobotsFunction -> (Select blocked URLs, Select passed URLs, Select sitemap URLs) (1/1) (27cb7dfb4bbe78901d79e0f558072230).
2018-03-16 15:46:04,498 INFO  org.apache.flink.runtime.taskmanager.Task                     - CheckUrlWithRobotsFunction -> (Select blocked URLs, Select passed URLs, Select sitemap URLs) (1/1) (27cb7dfb4bbe78901d79e0f558072230) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,501 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask           - Could not shut down timer service
java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
    at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1465)
    at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
    at java.lang.Thread.run(Thread.java:747)
2018-03-16 15:46:04,501 INFO  com.scaleunlimited.flinkcrawler.utils.ThreadedExecutor        - Waiting for pool termination (10 SECONDS)
2018-03-16 15:46:04,502 INFO  org.apache.flink.runtime.taskmanager.Task                     - DomainDBFunction (1/1) (a8665cb1055bea62957e695eef696674) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,502 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for DomainDBFunction (1/1) (a8665cb1055bea62957e695eef696674).
2018-03-16 15:46:04,502 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task DomainDBFunction (1/1) (a8665cb1055bea62957e695eef696674) [CANCELED]
2018-03-16 15:46:04,503 INFO  org.apache.flink.runtime.taskmanager.Task                     - LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (ebac8c804a0ed5492492de78e246aaf8) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,503 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (ebac8c804a0ed5492492de78e246aaf8).
2018-03-16 15:46:04,503 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (ebac8c804a0ed5492492de78e246aaf8) [CANCELED]
2018-03-16 15:46:04,503 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: Seed urls source -> Map (1/1) (2905777a74449d34a406278cbabb943f) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,503 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for Source: Seed urls source -> Map (1/1) (2905777a74449d34a406278cbabb943f).
2018-03-16 15:46:04,503 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task Source: Seed urls source -> Map (1/1) (2905777a74449d34a406278cbabb943f) [CANCELED]
2018-03-16 15:46:04,504 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code CheckUrlWithRobotsFunction -> (Select blocked URLs, Select passed URLs, Select sitemap URLs) (1/1) (27cb7dfb4bbe78901d79e0f558072230).
2018-03-16 15:46:04,505 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) (10de750d7939d7cfb037e8c522432b66).
2018-03-16 15:46:04,505 INFO  org.apache.flink.runtime.taskmanager.Task                     - FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) (10de750d7939d7cfb037e8c522432b66) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,505 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) (10de750d7939d7cfb037e8c522432b66).
2018-03-16 15:46:04,505 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task FetchUrlsFunction for sitemap -> (Select fetched URLs -> ParseSiteMapFunction -> OutlinkToStateUrlFunction, Select fetch status -> HandleFailedSiteMapFunction -> Sink: Unnamed) (1/1) (49e421d9f0638842b26b908dd32c7a18).
2018-03-16 15:46:04,505 INFO  org.apache.flink.runtime.taskmanager.Task                     - FetchUrlsFunction for sitemap -> (Select fetched URLs -> ParseSiteMapFunction -> OutlinkToStateUrlFunction, Select fetch status -> HandleFailedSiteMapFunction -> Sink: Unnamed) (1/1) (49e421d9f0638842b26b908dd32c7a18) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,505 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code FetchUrlsFunction for sitemap -> (Select fetched URLs -> ParseSiteMapFunction -> OutlinkToStateUrlFunction, Select fetch status -> HandleFailedSiteMapFunction -> Sink: Unnamed) (1/1) (49e421d9f0638842b26b908dd32c7a18).
2018-03-16 15:46:04,506 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (eddfd03d08ddae8009a424aa150ee307).
2018-03-16 15:46:04,506 INFO  org.apache.flink.runtime.taskmanager.Task                     - LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (eddfd03d08ddae8009a424aa150ee307) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,506 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (eddfd03d08ddae8009a424aa150ee307).
2018-03-16 15:46:04,507 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task IterationSink-11 (1/1) (241b8bbfdfe871bf84a6627192ddee05).
2018-03-16 15:46:04,507 INFO  org.apache.flink.runtime.taskmanager.Task                     - IterationSink-11 (1/1) (241b8bbfdfe871bf84a6627192ddee05) switched from RUNNING to CANCELING.
2018-03-16 15:46:04,510 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code IterationSink-11 (1/1) (241b8bbfdfe871bf84a6627192ddee05).
2018-03-16 15:46:04,514 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask           - Could not shut down timer service
java.lang.InterruptedException
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
    at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1465)
    at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService.shutdownAndAwaitPending(SystemProcessingTimeService.java:197)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:317)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
    at java.lang.Thread.run(Thread.java:747)
2018-03-16 15:46:04,514 INFO  com.scaleunlimited.flinkcrawler.utils.ThreadedExecutor        - Waiting for pool termination (100 SECONDS)
2018-03-16 15:46:04,515 INFO  org.apache.flink.runtime.taskmanager.Task                     - FetchUrlsFunction for sitemap -> (Select fetched URLs -> ParseSiteMapFunction -> OutlinkToStateUrlFunction, Select fetch status -> HandleFailedSiteMapFunction -> Sink: Unnamed) (1/1) (49e421d9f0638842b26b908dd32c7a18) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,515 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for FetchUrlsFunction for sitemap -> (Select fetched URLs -> ParseSiteMapFunction -> OutlinkToStateUrlFunction, Select fetch status -> HandleFailedSiteMapFunction -> Sink: Unnamed) (1/1) (49e421d9f0638842b26b908dd32c7a18).
2018-03-16 15:46:04,515 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task FetchUrlsFunction for sitemap -> (Select fetched URLs -> ParseSiteMapFunction -> OutlinkToStateUrlFunction, Select fetch status -> HandleFailedSiteMapFunction -> Sink: Unnamed) (1/1) (49e421d9f0638842b26b908dd32c7a18) [CANCELED]
2018-03-16 15:46:04,515 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task IterationSource-11 (51ab2be3c0afdf20f0ba748970d15f09)
2018-03-16 15:46:04,517 INFO  com.scaleunlimited.flinkcrawler.utils.ThreadedExecutor        - Waiting for pool termination (100 SECONDS)
2018-03-16 15:46:04,518 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task DomainDBFunction (a8665cb1055bea62957e695eef696674)
2018-03-16 15:46:04,518 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (ebac8c804a0ed5492492de78e246aaf8)
2018-03-16 15:46:04,518 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task Source: Seed urls source -> Map (2905777a74449d34a406278cbabb943f)
2018-03-16 15:46:04,518 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task FetchUrlsFunction for sitemap -> (Select fetched URLs -> ParseSiteMapFunction -> OutlinkToStateUrlFunction, Select fetch status -> HandleFailedSiteMapFunction -> Sink: Unnamed) (49e421d9f0638842b26b908dd32c7a18)
2018-03-16 15:46:04,520 DEBUG com.scaleunlimited.flinkcrawler.fetcher.commoncrawl.CommonCrawlFetcher  - Didn't find 'http://coplacdigital.org/resources/archival-resources/'
2018-03-16 15:46:04,528 INFO  org.apache.flink.runtime.taskmanager.Task                     - CheckUrlWithRobotsFunction -> (Select blocked URLs, Select passed URLs, Select sitemap URLs) (1/1) (27cb7dfb4bbe78901d79e0f558072230) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,528 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for CheckUrlWithRobotsFunction -> (Select blocked URLs, Select passed URLs, Select sitemap URLs) (1/1) (27cb7dfb4bbe78901d79e0f558072230).
2018-03-16 15:46:04,528 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task CheckUrlWithRobotsFunction -> (Select blocked URLs, Select passed URLs, Select sitemap URLs) (1/1) (27cb7dfb4bbe78901d79e0f558072230) [CANCELED]
2018-03-16 15:46:04,532 INFO  com.scaleunlimited.flinkcrawler.utils.ThreadedExecutor        - Waiting for pool termination (10 SECONDS)
2018-03-16 15:46:04,551 INFO  org.apache.flink.runtime.taskmanager.Task                     - LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (eddfd03d08ddae8009a424aa150ee307) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,551 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (eddfd03d08ddae8009a424aa150ee307).
2018-03-16 15:46:04,551 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (1/1) (eddfd03d08ddae8009a424aa150ee307) [CANCELED]
2018-03-16 15:46:04,553 INFO  org.apache.flink.runtime.taskmanager.Task                     - IterationSink-11 (1/1) (241b8bbfdfe871bf84a6627192ddee05) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,553 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for IterationSink-11 (1/1) (241b8bbfdfe871bf84a6627192ddee05).
2018-03-16 15:46:04,554 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task IterationSink-11 (1/1) (241b8bbfdfe871bf84a6627192ddee05) [CANCELED]
2018-03-16 15:46:04,555 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) (10de750d7939d7cfb037e8c522432b66).
2018-03-16 15:46:04,555 INFO  org.apache.flink.runtime.taskmanager.Task                     - Task FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) is already in state CANCELING
2018-03-16 15:46:04,556 DEBUG com.scaleunlimited.flinkcrawler.fetcher.commoncrawl.CommonCrawlFetcher  - Read 159,822 byte segment #275518 at 665,905,194 offset within cdx-00066.gz in 602ms (265,485 bytes/sec) for 'https://www.dealsplus.com/coupons/facebook'
2018-03-16 15:46:04,561 INFO  com.scaleunlimited.flinkcrawler.utils.ThreadedExecutor        - Waiting for pool termination (100 SECONDS)
2018-03-16 15:46:04,561 WARN  com.scaleunlimited.flinkcrawler.functions.ParseFunction       - Parsing exception https://www.bkosborne.com/blog/keeping-view-upcoming-events-fresh-drupal-8 (text/html)
java.lang.InterruptedException
    at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
    at java.util.concurrent.FutureTask.get(FutureTask.java:204)
    at com.scaleunlimited.flinkcrawler.parser.SimplePageParser.parse(SimplePageParser.java:170)
    at com.scaleunlimited.flinkcrawler.functions.ParseFunction.flatMap(ParseFunction.java:47)
    at com.scaleunlimited.flinkcrawler.functions.ParseFunction.flatMap(ParseFunction.java:20)
    at org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
    at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.pushToOperator(OperatorChain.java:464)
    at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.collect(OperatorChain.java:441)
    at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.collect(OperatorChain.java:415)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
    at org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:41)
    at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.pushToOperator(OperatorChain.java:464)
    at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.collect(OperatorChain.java:441)
    at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.collect(OperatorChain.java:415)
    at org.apache.flink.streaming.api.collector.selector.CopyingDirectedOutput.collect(CopyingDirectedOutput.java:59)
    at org.apache.flink.streaming.api.collector.selector.CopyingDirectedOutput.collect(CopyingDirectedOutput.java:34)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:831)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:809)
    at org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
    at org.apache.flink.streaming.api.operators.async.Emitter.output(Emitter.java:133)
    at org.apache.flink.streaming.api.operators.async.Emitter.run(Emitter.java:85)
    at java.lang.Thread.run(Thread.java:747)
2018-03-16 15:46:04,561 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) (10de750d7939d7cfb037e8c522432b66).
2018-03-16 15:46:04,561 INFO  org.apache.flink.runtime.taskmanager.Task                     - Task FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) is already in state CANCELING
2018-03-16 15:46:04,562 INFO  org.apache.flink.runtime.taskmanager.Task                     - FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) (10de750d7939d7cfb037e8c522432b66) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,562 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) (10de750d7939d7cfb037e8c522432b66).
2018-03-16 15:46:04,562 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (1/1) (10de750d7939d7cfb037e8c522432b66) [CANCELED]
2018-03-16 15:46:04,566 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task CheckUrlWithRobotsFunction -> (Select blocked URLs, Select passed URLs, Select sitemap URLs) (27cb7dfb4bbe78901d79e0f558072230)
2018-03-16 15:46:04,566 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task LengthenUrlsFunction -> NormalizeUrlsFunction -> ValidUrlsFilter (eddfd03d08ddae8009a424aa150ee307)
2018-03-16 15:46:04,566 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task IterationSink-11 (241b8bbfdfe871bf84a6627192ddee05)
2018-03-16 15:46:04,567 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task FetchUrlsFunction -> (Select fetch status, Select fetched URLs -> ParseFunction -> (OutlinkToStateUrlFunction, Select fetched content -> Sink: ContentSink, ContentTextSink -> Sink: Unnamed)) (10de750d7939d7cfb037e8c522432b66)
2018-03-16 15:46:04,569 DEBUG com.scaleunlimited.flinkcrawler.functions.UrlDBFunction       - 1139  http://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf
2018-03-16 15:46:04,582 INFO  org.apache.flink.runtime.taskmanager.Task                     - UrlDBFunction (1/1) (6284dbf11c5042f2ae98afa82f4e5ac7) switched from CANCELING to CANCELED.
2018-03-16 15:46:04,582 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for UrlDBFunction (1/1) (6284dbf11c5042f2ae98afa82f4e5ac7).
2018-03-16 15:46:04,583 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task UrlDBFunction (1/1) (6284dbf11c5042f2ae98afa82f4e5ac7) [CANCELED]
2018-03-16 15:46:04,583 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Un-registering task and sending final execution state CANCELED to JobManager for task UrlDBFunction (6284dbf11c5042f2ae98afa82f4e5ac7)
2018-03-16 15:46:04,588 DEBUG com.scaleunlimited.flinkcrawler.fetcher.commoncrawl.CommonCrawlFetcher  - Read 824 bytes from page at 680,418 offset from 'crawl-data/CC-MAIN-2017-22/segments/1495463607848.9/crawldiagnostics/CC-MAIN-20170524152539-20170524172539-00395.warc.gz' in 1ms (824,000 bytes/sec) for 'http://bestpractical.com/'
2018-03-16 15:46:04,588 DEBUG com.scaleunlimited.flinkcrawler.fetcher.commoncrawl.CommonCrawlFetcher  - Fetching 'https://bestpractical.com/' with 1 redirects
2018-03-16 15:46:04,598 DEBUG com.scaleunlimited.flinkcrawler.fetcher.commoncrawl.CommonCrawlFetcher  - Read 1,121 bytes from page at 539,817,578 offset from 'crawl-data/CC-MAIN-2017-22/segments/1495463612036.99/warc/CC-MAIN-20170529072631-20170529092631-00137.warc.gz' in 1ms (1,121,000 bytes/sec) for 'http://www.batterypoweronline.com/'
2018-03-16 15:46:04,598 DEBUG com.scaleunlimited.flinkcrawler.fetcher.commoncrawl.CommonCrawlFetcher  - Fetched 'http://www.batterypoweronline.com/' (200)
2018-03-16 15:46:04,598 DEBUG com.scaleunlimited.flinkcrawler.functions.FetchUrlsFunction   - Fetched 50201 bytes from 'http://www.batterypoweronline.com/'
kkrugler commented 6 years ago

I believe this will be resolved when the "cleaner termination" PR (#134) is merged. We have a separate issue to make the CC fetcher interruptable. And it turns out the InterruptedException is just Flink responding to the cancel request.