Closed Aenima4six2 closed 7 years ago
Reproing this issue seems like low hanging fruit. Is there a reason its on the back burner?
@Aenima4six2 The reason is lack of time and resources.
Respectfully, I've seen a number of UI fixes and other small issues get addressed in the time this was posted. Might want to reassess the priority of developer resources, as this might be a defect in one of the core areas of the app, and is pretty easy to test considering whats been given to repro.
@Aenima4six2 Thank you for the detailed information to test the issue you have!
I noticed that your test script can produce wrong results because it is waiting for a fixed amount of time (10 seconds) before running the validation search. To fix this, I created a PR for your test script in https://github.com/Aenima4six2/graylog-delivery-ratio/pull/1 to actually wait until all messages have been processed in Graylog.
With that PR I cannot reproduce dropped messages when sending messages via TCP. Can you please run your tests again with the improved test script? Thank you!
Test result:
I'll run it with the updated code and report back.
Sent from my iPhone
On Sep 14, 2017, at 12:07 PM, Bernd Ahlers notifications@github.com wrote:
@aenima4six2 Thank you for the detailed information to test the issue you have!
I noticed that your test script can produce wrong results because it is waiting for a fixed amount of time (10 seconds) before running the validation search. To fix this, I created a PR for your test script in Aenima4six2/graylog-delivery-ratio#1 to actually wait until all messages have been processed in Graylog.
With that PR I cannot reproduce dropped messages when sending messages via TCP. Can you please run your tests again with the improved test script? Thank you!
python3 test_delivery_ratio.py -t 1000000 -m TCP -T 50 -v -H 172.17.0.1
[2017-09-14 15:58:33.963640] Execution arguments {'api_port': 9000, 'es_refresh_interval': 15, 'host': '172.17.0.1', 'log_send_port': 12201, 'mode': ['TCP'], 'password': 'redacted', 'protocol': 'http', 'threads': 50, 'throttle': 0, 'total_requests': 1000000, 'username': 'admin', 'verbosity': 1} [2017-09-14 15:58:33.964141] Sending 1000000 requests with TCP for group a95cd2e8-45ac-4dff-bc6d-a60bf176157a [2017-09-14 15:58:33.964170] Sending in multi threaded mode with 50 threads [2017-09-14 15:58:34.211098] TCP Send progress: 1% (40493 msg/s) - [10000] sent - [0.22120356559753418] sec [2017-09-14 15:58:34.393312] TCP Send progress: 2% (46602 msg/s) - [20000] sent - [0.40442728996276855] sec [...] [2017-09-14 15:58:51.773664] TCP Send progress: 97% (54465 msg/s) - [970000] sent - [17.787083387374878] sec [2017-09-14 15:58:51.957894] TCP Send progress: 98% (54463 msg/s) - [980000] sent - [17.966859102249146] sec [2017-09-14 15:58:52.143800] TCP Send progress: 99% (54456 msg/s) - [990000] sent - [18.156416416168213] sec [2017-09-14 15:58:52.317824] TCP Send progress: 100% (54484 msg/s) - [1000000] sent - [18.324571132659912] sec [2017-09-14 15:58:52.317965] Sent [1000000] requests with TCP in [18.353821277618408] sec (54484 msg/s) [2017-09-14 15:58:52.325125] Progress: journal-size=876589 output-throughput=6500.0 [2017-09-14 15:58:54.333208] Progress: journal-size=903500 output-throughput=11000.0 [2017-09-14 15:58:56.341206] Progress: journal-size=882500 output-throughput=11000.0 [2017-09-14 15:58:58.349311] Progress: journal-size=861000 output-throughput=10239.0 [2017-09-14 15:59:00.359778] Progress: journal-size=840000 output-throughput=10500.0 [2017-09-14 15:59:02.367867] Progress: journal-size=821000 output-throughput=8500.0 [2017-09-14 15[2017-09-14 15:58:51.773664] TCP Send progress: 97% (54465 msg/s) - [970000] sent - [17.787083387374878] sec [2017-09-14 15:58:51.957894] TCP Send progress: 98% (54463 msg/s) - [980000] sent - [17.966859102249146] sec [2017-09-14 15:58:52.143800] TCP Send progress: 99% (54456 msg/s) - [990000] sent - [18.156416416168213] sec [2017-09-14 15:58:52.317824] TCP Send progress: 100% (54484 msg/s) - [1000000] sent - [18.324571132659912] sec [2017-09-14 15:58:52.317965] Sent [1000000] requests with TCP in [18.353821277618408] sec (54484 msg/s) [2017-09-14 15:58:52.325125] Progress: journal-size=876589 output-throughput=6500.0 [2017-09-14 15:58:54.333208] Progress: journal-size=903500 output-throughput=11000.0 [2017-09-14 15:58:56.341206] Progress: journal-size=882500 output-throughput=11000.0 [2017-09-14 15:58:58.349311] Progress: journal-size=861000 output-throughput=10239.0 [2017-09-14 15:59:00.359778] Progress: journal-size=840000 output-throughput=10500.0 [2017-09-14 15:59:02.367867] Progress: journal-size=821000 output-throughput=8500.0 [2017-09-14 15:59:04.376609] Progress: journal-size=803500 output-throughput=8265.0 [2017-09-14 15:59:06.386106] Progress: journal-size=785000 output-throughput=10000.0 [2017-09-14 15:59:08.394732] Progress: journal-size=764500 output-throughput=10500.0 [2017-09-14 15:59:10.403828] Progress: journal-size=745000 output-throughput=9500.0 [2017-09-14 15:59:12.412626] Progress: journal-size=725000 output-throughput=11000.0 [2017-09-14 15:59:14.421021] Progress: journal-size=704500 output-throughput=9500.0 [2017-09-14 15:59:16.429375] Progress: journal-size=686500 output-throughput=8500.0 [2017-09-14 15:59:18.439488] Progress: journal-size=668500 output-throughput=11500.0 [2017-09-14 15:59:20.448289] Progress: journal-size=649000 output-throughput=9500.0 [2017-09-14 15:59:22.458166] Progress: journal-size=638000 output-throughput=2500.0 [2017-09-14 15:59:24.466258] Progress: journal-size=617000 output-throughput=11000.0 [2017-09-14 15:59:26.475315] Progress: journal-size=598500 output-throughput=8000.0 [2017-09-14 15:59:28.493576] Progress: journal-size=580500 output-throughput=11000.0 [2017-09-14 15:59:30.502702] Progress: journal-size=559000 output-throughput=10500.0 [2017-09-14 15:59:32.511697] Progress: journal-size=538000 output-throughput=10000.0 [2017-09-14 15:59:34.520938] Progress: journal-size=518000 output-throughput=9407.0 [2017-09-14 15:59:36.529293] Progress: journal-size=499500 output-throughput=7000.0 [2017-09-14 15:59:38.556989] Progress: journal-size=472000 output-throughput=10000.0 [2017-09-14 15:59:40.567236] Progress: journal-size=453500 output-throughput=9000.0 [2017-09-14 15:59:42.574702] Progress: journal-size=434500 output-throughput=10000.0 [2017-09-14 15:59:44.581214] Progress: journal-size=416500 output-throughput=9500.0 [2017-09-14 15:59:46.591113] Progress: journal-size=399500 output-throughput=7532.0 [2017-09-14 15:59:48.599561] Progress: journal-size=384500 output-throughput=7000.0 [2017-09-14 15:59:50.611036] Progress: journal-size=366000 output-throughput=9409.0 [2017-09-14 15:59:52.619467] Progress: journal-size=348500 output-throughput=8500.0 [2017-09-14 15:59:54.626403] Progress: journal-size=330500 output-throughput=9500.0 [2017-09-14 15:59:56.633153] Progress: journal-size=313500 output-throughput=8637.0 [2017-09-14 15:59:58.641964] Progress: journal-size=295000 output-throughput=9497.0 [2017-09-14 16:00:00.650777] Progress: journal-size=276500 output-throughput=10000.0 [2017-09-14 16:00:02.661954] Progress: journal-size=258000 output-throughput=9500.0 [2017-09-14 16:00:04.668485] Progress: journal-size=239500 output-throughput=8500.0 [2017-09-14 16:00:06.676681] Progress: journal-size=221000 output-throughput=9000.0 [2017-09-14 16:00:08.684790] Progress: journal-size=201500 output-throughput=9500.0 [2017-09-14 16:00:10.692096] Progress: journal-size=182000 output-throughput=9500.0 [2017-09-14 16:00:12.699939] Progress: journal-size=173000 output-throughput=5000.0 [2017-09-14 16:00:14.708850] Progress: journal-size=151500 output-throughput=10500.0 [2017-09-14 16:00:16.715184] Progress: journal-size=136500 output-throughput=5000.0 [2017-09-14 16:00:18.720927] Progress: journal-size=122000 output-throughput=11000.0 [2017-09-14 16:00:20.726616] Progress: journal-size=102500 output-throughput=9500.0 [2017-09-14 16:00:22.736515] Progress: journal-size=82000 output-throughput=10500.0 [2017-09-14 16:00:24.742558] Progress: journal-size=60500 output-throughput=11000.0 [2017-09-14 16:00:26.749938] Progress: journal-size=45500 output-throughput=5000.0 [2017-09-14 16:00:28.759238] Progress: journal-size=23500 output-throughput=11423.0 [2017-09-14 16:00:30.780972] Progress: journal-size=8000 output-throughput=5500.0 [2017-09-14 16:00:32.800092] Waiting for another 20s for the index refresh... [2017-09-14 16:00:52.820542] Validating sent data -> http://172.17.0.1:9000/api/search/universal/relative?query=%22a95cd2e8-45ac-4dff-bc6d-a60bf176157a%22&range=0&limit=1 [2017-09-14 16:00:52.934723] http://172.17.0.1:9000/api/search/universal/relative?query=%22a95cd2e8-45ac-4dff-bc6d-a60bf176157a%22&range=0&limit=1 request execution time 0.111564 [2017-09-14 16:00:52.934772] [1000000] requests generated - [1000000] sent - [1000000] (100%) validated [2017-09-14 16:00:52.934927] Test completed in [138.9706678390503] sec :59:04.376609] Progress: journal-size=803500 output-throughput=8265.0 [2017-09-14 15:59:06.386106] Progress: journal-size=785000 output-throughput=10000.0 [2017-09-14 15:59:08.394732] Progress: journal-size=764500 output-throughput=10500.0 [2017-09-14 15:59:10.403828] Progress: journal-size=745000 output-throughput=9500.0 [2017-09-14 15:59:12.412626] Progress: journal-size=725000 output-throughput=11000.0 [2017-09-14 15:59:14.421021] Progress: journal-size=704500 output-throughput=9500.0 [2017-09-14 15:59:16.429375] Progress: journal-size=686500 output-throughput=8500.0 [2017-09-14 15:59:18.439488] Progress: journal-size=668500 output-throughput=11500.0 [2017-09-14 15:59:20.448289] Progress: journal-size=649000 output-throughput=9500.0 [2017-09-14 15:59:22.458166] Progress: journal-size=638000 output-throughput=2500.0 [2017-09-14 15:59:24.466258] Progress: journal-size=617000 output-throughput=11000.0 [2017-09-14 15:59:26.475315] Progress: journal-size=598500 output-throughput=8000.0 [2017-09-14 15:59:28.493576] Progress: journal-size=580500 output-throughput=11000.0 [2017-09-14 15:59:30.502702] Progress: journal-size=559000 output-throughput=10500.0 [2017-09-14 15:59:32.511697] Progress: journal-size=538000 output-throughput=10000.0 [2017-09-14 15:59:34.520938] Progress: journal-size=518000 output-throughput=9407.0 [2017-09-14 15:59:36.529293] Progress: journal-size=499500 output-throughput=7000.0 [2017-09-14 15:59:38.556989] Progress: journal-size=472000 output-throughput=10000.0 [2017-09-14 15:59:40.567236] Progress: journal-size=453500 output-throughput=9000.0 [2017-09-14 15:59:42.574702] Progress: journal-size=434500 output-throughput=10000.0 [2017-09-14 15:59:44.581214] Progress: journal-size=416500 output-throughput=9500.0 [2017-09-14 15:59:46.591113] Progress: journal-size=399500 output-throughput=7532.0 [2017-09-14 15:59:48.599561] Progress: journal-size=384500 output-throughput=7000.0 [2017-09-14 15:59:50.611036] Progress: journal-size=366000 output-throughput=9409.0 [2017-09-14 15:59:52.619467] Progress: journal-size=348500 output-throughput=8500.0 [2017-09-14 15:59:54.626403] Progress: journal-size=330500 output-throughput=9500.0 [2017-09-14 15:59:56.633153] Progress: journal-size=313500 output-throughput=8637.0 [2017-09-14 15:59:58.641964] Progress: journal-size=295000 output-throughput=9497.0 [2017-09-14 16:00:00.650777] Progress: journal-size=276500 output-throughput=10000.0 [2017-09-14 16:00:02.661954] Progress: journal-size=258000 output-throughput=9500.0 [2017-09-14 16:00:04.668485] Progress: journal-size=239500 output-throughput=8500.0 [2017-09-14 16:00:06.676681] Progress: journal-size=221000 output-throughput=9000.0 [2017-09-14 16:00:08.684790] Progress: journal-size=201500 output-throughput=9500.0 [2017-09-14 16:00:10.692096] Progress: journal-size=182000 output-throughput=9500.0 [2017-09-14 16:00:12.699939] Progress: journal-size=173000 output-throughput=5000.0 [2017-09-14 16:00:14.708850] Progress: journal-size=151500 output-throughput=10500.0 [2017-09-14 16:00:16.715184] Progress: journal-size=136500 output-throughput=5000.0 [2017-09-14 16:00:18.720927] Progress: journal-size=122000 output-throughput=11000.0 [2017-09-14 16:00:20.726616] Progress: journal-size=102500 output-throughput=9500.0 [2017-09-14 16:00:22.736515] Progress: journal-size=82000 output-throughput=10500.0 [2017-09-14 16:00:24.742558] Progress: journal-size=60500 output-throughput=11000.0 [2017-09-14 16:00:26.749938] Progress: journal-size=45500 output-throughput=5000.0 [2017-09-14 16:00:28.759238] Progress: journal-size=23500 output-throughput=11423.0 [2017-09-14 16:00:30.780972] Progress: journal-size=8000 output-throughput=5500.0 [2017-09-14 16:00:32.800092] Waiting for another 20s for the index refresh... [2017-09-14 16:00:52.820542] Validating sent data -> http://172.17.0.1:9000/api/search/universal/relative?query=%22a95cd2e8-45ac-4dff-bc6d-a60bf176157a%22&range=0&limit=1 [2017-09-14 16:00:52.934723] http://172.17.0.1:9000/api/search/universal/relative?query=%22a95cd2e8-45ac-4dff-bc6d-a60bf176157a%22&range=0&limit=1 request execution time 0.111564 [2017-09-14 16:00:52.934772] [1000000] requests generated - [1000000] sent - [1000000] (100%) validated [2017-09-14 16:00:52.934927] Test completed in [138.9706678390503] sec — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@bernd, that was the issue. Thanks so much for looking at this.
@Aenima4six2 Thank you for the feedback, I am glad it works for you now!
When messages are sent at a rate (presumably too high for graylog to process), messages are dropped without any kind of client side errors occurring. Tests cases were performed on multiple dockerized instances of graylog server using the following test script: https://github.com/Aenima4six2/graylog-delivery-ratio
Expected Behavior
Using a TCP input should guarantee any messages sent should be successfully delivered and processed or an error should occur at the client end.
Current Behavior
When high rate of messages are sent to Graylog server, the client is informed that all packets (thus messages) were received, however, the Graylog journal and Elasticsearch index show otherwise (IE: 10000 messages sent and 9800 in the journal).
Possible Solution
Possibly rate limit or throttle incoming TCP/UDP connections such that all can be processed in time without timeouts or dropped messages.
Steps to Reproduce (for bugs)
python3 test_delivery_ratio.py -t 10000 -m TCP -T 1 -v
Context
This issue requires us to use GELF over HTTP, which is inherently slower than TCP. Note: Also tried raising the TCP receive buffer to 50 megabytes but problem still persists. Problem seems to be at its worse when graylog is under heavy load and concurrent TCP sends are in progress.
Your Environment
Using the following docker compose file