apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

HybridSpout cannot use arbitrary metadata field as PartitionField #885

Closed GerbenKD closed 3 years ago

GerbenKD commented 3 years ago

When using the HybridSpout I get cannot cast ArrayList to String exceptions when using a field from metadata as the key.

The error is in the following line:

https://github.com/DigitalPebble/storm-crawler/blob/eb735b4f704e86ccdf004ca4dee107f331fad2a4/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/persistence/HybridSpout.java#L178

Since metadata field values are always lists, this error seems to make sense.

jnioche commented 3 years ago

Fields can be multivalued but unless I am mistaken, if a single value is present then a String is returned. Any reason why you'd have more than one value for that field?

GerbenKD commented 3 years ago

Hi Julien. I posted an issue too soon, likely this is because of how we serialize the Metadata to elasticsearch. I see this is different in your implementation of the StatusUpdaterBolt. My apologies.

cheers,

Gerben

GerbenKD commented 3 years ago

Hmm actually, also when using the SDK StatusUpdaterBolt I see single values as arrays in ES.

cheers,

Gerben

jnioche commented 3 years ago

ok thanks, will have a look

jnioche commented 3 years ago

I've just added a test to try to reproduce the problem and it's not happening. I really don't think single values are returned by ES as lists. @GerbenKD could you please double check and try to reproduce the problem? Thanks

GerbenKD commented 3 years ago

Hi Julien,

Attached a screenshot of the debugger for the HybridSpout. You see that the hostname field is an ArrayList of size 1. In ES this is also stored as a List with a single element. Could this be an ES difference? We are using 7.5.2.

hybridspout

jnioche commented 3 years ago

Tested with 7.5.2, no difference. It would have been too big a change for a minor release. What's your mapping for the status index? How did you create it?

GerbenKD commented 3 years ago

Status index is created by the SDK. I wonder whether the issue is this line though:

https://github.com/DigitalPebble/storm-crawler/blob/0f327d347514ae7fe1fb0ce16693d90765d4dea6/external/elasticsearch/src/main/java/com/digitalpebble/stormcrawler/elasticsearch/persistence/StatusUpdaterBolt.java#L218

Only for metadata.key a non-array value is added to the metadata builder object. All other metadata values are String[]. I tried to use another metadata value in HybridSpout (as in the other issue I created). Hence I get the exception.

Gerben

jnioche commented 3 years ago

Managed to reproduce it in the test, thanks. Can you please try the fix in ea6d3fb?

GerbenKD commented 3 years ago

Works! Thanks!

Gerben

jnioche commented 3 years ago

great. thanks for reporting it @GerbenKD