apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

NPE in WARCHdfsBolt on cleanup() #720

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

When a (local) topology is killed and no tuples have been passed to the WARCHdfsBolt, the cleanup() will raise a NPE:

68227 [Thread-91-warc-executor[36 36]] INFO  o.a.s.util - Async loop interrupted!
68227 [Thread-89-disruptor-executor[36 36]-send-queue] INFO  o.a.s.util - Async loop interrupted!
68227 [SLOT_1024] INFO  c.d.s.w.GzipHdfsBolt - Cleanup called on bolt
68227 [SLOT_1024] ERROR o.a.s.d.s.Slot - Error when processing event
java.lang.NullPointerException: null
        at com.digitalpebble.stormcrawler.warc.GzipHdfsBolt.cleanup(GzipHdfsBolt.java:187) ~[storm-crawler-fight-2.0-SNAPSHOT.jar:?]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_181]
        ...
        at org.apache.storm.ProcessSimulator.killProcess(ProcessSimulator.java:67) ~[storm-core-1.2.2.jar:1.2.2]
        at org.apache.storm.daemon.supervisor.LocalContainer.kill(LocalContainer.java:69) ~[storm-core-1.2.2.jar:1.2.2]
        ...

Of course, it's a minor issue (the topology is shut down anyway) but should either check always initialize the output stream or check whether it is initialized.

sebastian-nagel commented 5 years ago

Clarification: happens in local mode also with a successfully written WARC file. Maybe related: check why (local mode?) a checksum filesystem was chosen despite the configuration:

# grep -A1 warc crawler-conf.yaml 
  warc:
    fs.file.impl: "org.apache.hadoop.fs.RawLocalFileSystem"

# ls -a /tmp/warc/
.crawl-20190510132330-01-00000.warc.gz.crc
crawl-20190510132330-01-00000.warc.gz
jnioche commented 5 years ago

Did you specify the config key for the bolt?

      - name: "withConfigKey"
        args:
          - "warc"
jnioche commented 5 years ago

re-error in cleanup() : we could simply check that this.out is not null

sebastian-nagel commented 5 years ago

Actually, the cleanup method is called twice from different threads when running in local mode. One does not have the output stream initialized, cf. log output with the null-check applied:

67419 [SLOT_1024] INFO  o.a.s.d.worker - Shutting down executors
67419 [SLOT_1024] INFO  o.a.s.d.executor - Shutting down executor warc:[41 41]
67422 [Thread-19-warc-executor[41 41]] INFO  o.a.s.util - Async loop interrupted!
67427 [Thread-18-disruptor-executor[41 41]-send-queue] INFO  o.a.s.util - Async loop interrupted!
67430 [SLOT_1024] INFO  c.d.s.w.GzipHdfsBolt - Cleanup called on bolt
67430 [SLOT_1024] WARN  c.d.s.w.GzipHdfsBolt - Nothing to cleanup: output stream not initialized
67431 [SLOT_1024] INFO  o.a.s.d.executor - Shut down executor warc:[41 41]
...
72837 [SLOT_1027] INFO  o.a.s.d.executor - Shutting down executor warc:[42 42]
72838 [Thread-329-warc-executor[42 42]] INFO  o.a.s.util - Async loop interrupted!
72840 [Thread-328-disruptor-executor[42 42]-send-queue] INFO  o.a.s.util - Async loop interrupted!
72841 [SLOT_1027] INFO  c.d.s.w.GzipHdfsBolt - Cleanup called on bolt
72846 [SLOT_1027] INFO  o.a.s.d.executor - Shut down executor warc:[42 42]
jnioche commented 5 years ago

Merged, thanks @sebastian-nagel