apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

AggregationSpout error due SimpleDateFormat not thread safe #809

Closed jcruzmartini closed 4 years ago

jcruzmartini commented 4 years ago

sometimes when the crawl is finishing and we only have few URLs pending, the nextTuple() in the aggregation spout is being called steadily (totally expected). If you have the property es.status.concurrentRequests in a number greater than 1 and your property spout.min.delay.queries is too low, you may get this error

java.io.IOException: Unable to parse response body for Response{requestLine=POST /status*/_search?typed_keys=true&ignore_unavailable=false&expand_wildcards=open&allow_no_indices=true&preference=_shards%3A8&ignore_throttled=true&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true HTTP/1.1, host=https://elasticsearch-coordinating:9200, response=HTTP/1.1 200 OK}
    at org.elasticsearch.client.RestHighLevelClient$1.onSuccess(RestHighLevelClient.java:1665) [stormjar.jar:?]
    at org.elasticsearch.client.RestClient$FailureTrackingResponseListener.onSuccess(RestClient.java:590) [stormjar.jar:?]
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:333) [stormjar.jar:?]
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:327) [stormjar.jar:?]
    at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:122) [stormjar.jar:?]
    at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:181) [stormjar.jar:?]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:448) [stormjar.jar:?]
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:338) [stormjar.jar:?]
    at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265) [stormjar.jar:?]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81) [stormjar.jar:?]
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:121) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [stormjar.jar:?]
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) [stormjar.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
 Caused by: java.lang.NumberFormatException: For input string: ""
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) ~[?:1.8.0_252]
    at java.lang.Long.parseLong(Long.java:601) ~[?:1.8.0_252]
    at java.lang.Long.parseLong(Long.java:631) ~[?:1.8.0_252]
    at java.text.DigitList.getLong(DigitList.java:195) ~[?:1.8.0_252]
    at java.text.DecimalFormat.parse(DecimalFormat.java:2084) ~[?:1.8.0_252]
    at java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:1869) ~[?:1.8.0_252]
    at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1514) ~[?:1.8.0_252]
    at java.text.DateFormat.parse(DateFormat.java:364) ~[?:1.8.0_252]
    at com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout.onResponse(AggregationSpout.java:258) ~[stormjar.jar:?]
    at com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout.onResponse(AggregationSpout.java:71)

After some reserch we realized that this error is happening because 2 responses are trying to use the SDF at same time. We tried reducing thees.status.concurrentRequests to 1 and increase spout.min.delay.queries and the error has gone. If you want we can include a fix for this, we have 2 options:

  1. syncronize the use of the SDF
                        synchronized (formatter) {
                            mostRecentDateFound = formatter.parse(strDate);
            }
  2. use DateTimeFormatter(thread safe) instead of SDF

Extra information

jnioche commented 4 years ago

Good catch, thanks! I'd go for DateTimeFormatter - it's the modern way of doing it ;-) Do you think you could provide a PR for it?

jcruzmartini commented 4 years ago

@jnioche sure we will add a new PR soon. thanks for your quick response