Closed sebastian-nagel closed 5 years ago
Clarification: happens in local mode also with a successfully written WARC file. Maybe related: check why (local mode?) a checksum filesystem was chosen despite the configuration:
# grep -A1 warc crawler-conf.yaml
warc:
fs.file.impl: "org.apache.hadoop.fs.RawLocalFileSystem"
# ls -a /tmp/warc/
.crawl-20190510132330-01-00000.warc.gz.crc
crawl-20190510132330-01-00000.warc.gz
Did you specify the config key for the bolt?
- name: "withConfigKey"
args:
- "warc"
re-error in cleanup() : we could simply check that this.out is not null
Actually, the cleanup method is called twice from different threads when running in local mode. One does not have the output stream initialized, cf. log output with the null-check applied:
67419 [SLOT_1024] INFO o.a.s.d.worker - Shutting down executors
67419 [SLOT_1024] INFO o.a.s.d.executor - Shutting down executor warc:[41 41]
67422 [Thread-19-warc-executor[41 41]] INFO o.a.s.util - Async loop interrupted!
67427 [Thread-18-disruptor-executor[41 41]-send-queue] INFO o.a.s.util - Async loop interrupted!
67430 [SLOT_1024] INFO c.d.s.w.GzipHdfsBolt - Cleanup called on bolt
67430 [SLOT_1024] WARN c.d.s.w.GzipHdfsBolt - Nothing to cleanup: output stream not initialized
67431 [SLOT_1024] INFO o.a.s.d.executor - Shut down executor warc:[41 41]
...
72837 [SLOT_1027] INFO o.a.s.d.executor - Shutting down executor warc:[42 42]
72838 [Thread-329-warc-executor[42 42]] INFO o.a.s.util - Async loop interrupted!
72840 [Thread-328-disruptor-executor[42 42]-send-queue] INFO o.a.s.util - Async loop interrupted!
72841 [SLOT_1027] INFO c.d.s.w.GzipHdfsBolt - Cleanup called on bolt
72846 [SLOT_1027] INFO o.a.s.d.executor - Shut down executor warc:[42 42]
Merged, thanks @sebastian-nagel
When a (local) topology is killed and no tuples have been passed to the WARCHdfsBolt, the cleanup() will raise a NPE:
Of course, it's a minor issue (the topology is shut down anyway) but should either check always initialize the output stream or check whether it is initialized.