Clarification On Maximum Visibility Reached

bradfordlittooysonos commented 5 years ago

Hello,

First wanted to say thank you very much for creating this module. It has served our purpose extremely well.

I had a question specifically on the poller piece of code here: https://github.com/cherweg/logstash-input-s3-sns-sqs/blob/2512dbb65fe5483e4b07f927fda99dedab6cbe39/lib/logstash/inputs/sqs/poller.rb#L102

I'm noticing our logstash will periodically hit a situation where we run into the following error:

2019-11-19T08:06:07.935050356Z [2019-11-19T08:06:07,934][WARN ][logstash.inputs.s3snssqs ] [Worker b0da94f6510440794a1d9ee3a937f01d83f608028ae8bbe292173e933857720b/0/extender/um/19/11/19/U-1574143845000-task_201911190610_2741_m_000000-part-00000.gz] Extended visibility for a long running message {:visibility=>7878.0} 2019-11-19T08:08:02.00981826Z [2019-11-19T08:08:02,009][WARN ][logstash.inputs.s3snssqs ] [Worker b0da94f6510440794a1d9ee3a937f01d83f608028ae8bbe292173e933857720b/0/extender/um/19/11/19/U-1574143845000-task_201911190610_2741_m_000000-part-00000.gz] Extended visibility for a long running message {:visibility=>7992.0} 2019-11-19T08:08:02.010975323Z [2019-11-19T08:08:02,010][ERROR][logstash.inputs.s3snssqs ] [Worker b0da94f6510440794a1d9ee3a937f01d83f608028ae8bbe292173e933857720b/0/extender/um/19/11/19/U-1574143845000-task_201911190610_2741_m_000000-part-00000.gz] Maximum visibility reached! We will delete this message from queue!

After we hit the following line Maximum visibility reached! We will delete this message from queue!, logstash will essentially remain alive but stop processing any new data/messages.

I think this is driven by the following exception: https://github.com/cherweg/logstash-input-s3-sns-sqs/blob/2512dbb65fe5483e4b07f927fda99dedab6cbe39/lib/logstash/inputs/sqs/poller.rb#L102 which would kill the thread? (Apologies if I'm misunderstanding this as I'm not extremely proficient in ruby)

Is this the intended behavior?

Our current settings are set to the default: config :visibility_timeout, :validate => :number, :default => 120 config :max_processing_time, :validate => :number, :default => 8000 Outside of raising our max_processing_time to a higher number, Is there a way we could simply put the message back into the queue and continue processing a new message? Do the symptoms presented seem to align with the behavior I am seeing?

Thank you very much

christianherweg0807 commented 5 years ago

ok, when implementing these feature i thought about the reason why files don´t come back from the download->unzip->yield line by line->event.set loop. Mostly there is something wrong with encoding or line endings. The reason why i drop them is, that i don´t beleave that these files would be processed better next time.

Am I right?

christianherweg0807 commented 5 years ago

-> logstash will essentially remain alive but stop processing any new data/messages. Such was not my intention. I´ve to look in the thread machine.

christianherweg0807 commented 4 years ago

Here is a first idea of restarting "dead" threads. If we raise an exception the thread.status is nil. So if there is no stop? we could replace this thread: https://github.com/cherweg/logstash-input-s3-sns-sqs/commit/a257995e743813ed430aad5885b18abc2904a729

christianherweg0807 commented 4 years ago

Any comments to : https://github.com/cherweg/logstash-input-s3-sns-sqs/issues/32#issuecomment-559223132 ?

bradfordlittooysonos commented 4 years ago

Thanks for the replies, and apologies for the delay in response.

I response to https://github.com/cherweg/logstash-input-s3-sns-sqs/issues/32#issuecomment-559223132 , I think the first half assumption is valid (The file is corrupt and can't be processed), but for our case, we were eventually able to process the file on retry when we rebooted logstash. From my perspective, this corrupt file issue seemed to be a bug on AWS S3 side and was related to eventual consistency with objs in s3. (I think this would explain why we couldn't read the obj the first go-around but were able to the second go-around)

Interestingly enough this issue has totally subsided since November 22, making me think that AWS patched the issue S3. Before, we were seeing roughly 2-3 pods fail (out of the 10 pods we run) per day. A-lot of changes on AWS seem to have gone out in the last 3 weeks with re-invent going on so I wouldn't be totally surprised if an issue

So assuming aws fixed this eventual consistency issue, I do think that https://github.com/cherweg/logstash-input-s3-sns-sqs/issues/32#issuecomment-559223132 is now totally valid.

Given that though, I still think there is a good reason to add logic to reboot the thread in the event it dies and the solution in https://github.com/cherweg/logstash-input-s3-sns-sqs/commit/a257995e743813ed430aad5885b18abc2904a729 looks good to me to start.

christianherweg0807 commented 4 years ago

Hey. Just release 2.1.0. https://github.com/cherweg/logstash-input-s3-sns-sqs/blob/7c23c2d607fc8f1bdf0f88f594985deb409fb71d/lib/logstash/inputs/s3snssqs.rb#L280-L293

This is a little bit more complex, but nessasary.

enjoy Christian

bradfordlittooysonos commented 4 years ago

Thanks @christianherweg0807 I'll upgrade the connector on our end.

bradfordlittooysonos commented 4 years ago

I just wanted to follow up and let you know that I've implemented this change on our end in production and it was working great. Thanks again.

zeph commented 4 years ago

I'm running on 2.1.1 and stumbled on this... https://github.com/cherweg/logstash-input-s3-sns-sqs/issues/42

are they related? seems so

zeph commented 4 years ago

same happening on 2.1.0

cherweg / logstash-input-s3-sns-sqs

Clarification On Maximum Visibility Reached #32