Help with TimeoutException

TWCable / grabbit

Grabbit - Fast Content Sync tool for AEM/CQ

Apache License 2.0

125 stars 64 forks source link

Help with TimeoutException #195

Closed himmelmrm closed 6 years ago

himmelmrm commented 6 years ago

I frequently experience job failures due to TimeoutException.

Here's the server error: 31.10.2017 15:25:37.957 *ERROR* [192.168.5.1 [1509477849521] GET /grabbit/content HTTP/1.1] com.twcable.grabbit.server.batch.steps.jcrnodes.JcrNodesWriter Exception occurred while writing the current chunk org.apache.sling.engine.impl.helper.ClientAbortException: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 60000/60000 ms Sometimes it takes 3 or 4 attempts to get a complete transfer.

This is happening between two systems on the same network.

any ideas on why this is happening?
does anyone know where this timeout (60 seconds) is configured?

Thanks!

jbornemann commented 6 years ago

Do you have a narrowed down to a certain set of suspect nodes? Typically this happens for syncing batches containing nodes of considerable size ( e.g DAM nodes ). A way to manage this is to use the batchSize parameter to lower the batch size for paths containing large nodes that take time to steam.

himmelmrm commented 6 years ago

so ... the 60 second timeout applies to the batch or to each individual node being transferred?

jbornemann commented 6 years ago

It all happens in a pipeline, things get processed batch by batch. One step reads data from the server, until a certain amount of "nodes" reach the batch size; then it hands the batch off to the writer step (which you are experiencing issues in).

If the reader spends too much time reading in enough nodes to satisfy the batch, while the writer finishes up with the previous batch - you will see this if the writer twiddles it's thumbs for 60 seconds afterwards, waiting for the next batch from the reader.

The IO timeout can't be configured currently, but it may be possible to achieve the same thing by lowering the number of nodes that satisfy the batch - optimizing for latency, rather than throughput.

That said, I'm sure the timeout configuration would be an appreciated improvement, PRs are welcome!

himmelmrm commented 6 years ago

Does deleteBeforeWrite factor in? With deleteBeforeWrite=true, could a large tree cause a Timeout?

sagarsane commented 6 years ago

Actually, @himmelmrm -- I think you are running into Jetty's configured timeout ... I remember seeing this before when deleteBeforeWrite=true as you eluded ..

screen shot 2017-11-01 at 4 58 52 pm

It is configurable under org.apache.felix.http .. it can time out if you are deleting a large path using thedeleteBeforeWrite feature.

jbornemann commented 6 years ago

@sagarsane really? Because deleteBeforeWrite happens before a connection is made? I guess either way, if you can configure the timeout, that should help.

<batch:step id="deleteBeforeWrite" next="startHttpConnection">
      <batch:tasklet ref="deleteBeforeWriteTasklet" transaction-manager="clientTransactionManager"/>
</batch:step>

himmelmrm commented 6 years ago

thanks @jbornemann and @sagarsane --- I've been able to mostly eliminate this issue by increasing the Jetty Connection Timeout value on the server side.

sagarsane commented 6 years ago

Ok great. Thanks @himmelmrm !