RFC: All new logging/tracing system

AndrewDryga commented 7 years ago

Logging got few issues and that starts to worry me:

We need a good way to sanitize logs from sensitive data;
To log request and response we need to have them as a binary data, which works well for small json that can be stored in a request context, but really bad for all other use cases;
PostgreSQL is a bad place to store logs;
The real usage demand is more about "I want to trace everything that happened in my system" rather than "I want to know what the response and request were like".

I guess it is time to reconsider this part of Annon to be a global adapter-based tracing module, which can keep working as is (which is okay for small deployments) and get much more features for production systems: send data to SaaS services, other data stores - Elasticsearch or Cassandra, tracing systems - DataDog APM, NewRelic APM, AppSignal or Scout.

Adapter API would be large and demanding, so not every service would be able to be integrated, but that is okay since most of the issues above would have an approach to solve them.

AndrewDryga commented 7 years ago

Might benefit from using: https://github.com/zachdaniel/spandex

AndrewDryga commented 7 years ago

Related to: #209

AndrewDryga commented 7 years ago

https://github.com/derailed/ex_ray

AndrewDryga commented 6 years ago

Writing logs to ProstreSQL under load is obviously inefficient, by default logs are written from an async process, but it's message queue gets backed up pretty fast.

Even worse, because the queue is so long, processes that make calls to that queue are punished by BEAM backpressure mechanism (number of reductions taken for a message passing is proportional to queue length). So deviation for incoming connection grows because processes that handle them are suspended. I believe, in next OTP versions, it's even planned to get rid of this backpressure.

Some data (with async log writer):

ab -p ./post_loc.txt -l -n 100000 -c 100 http://127.0.0.1:4000/httpbin/post 
This is ApacheBench, Version 2.3 <$Revision: 1796539 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests

Server Software:        gunicorn/19.6.0
Server Hostname:        127.0.0.1
Server Port:            4000

Document Path:          /httpbin/post
Document Length:        Variable

Concurrency Level:      100
Time taken for tests:   475.071 seconds
Complete requests:      100000
Failed requests:        0
Non-2xx responses:      621
Total transferred:      1104451167 bytes
Total body sent:        406500000
HTML transferred:       1069272902 bytes
Requests per second:    210.49 [#/sec] (mean)
Time per request:       475.071 [ms] (mean)
Time per request:       4.751 [ms] (mean, across all concurrent requests)
Transfer rate:          2270.32 [Kbytes/sec] received
                        835.61 kb/s sent
                        3105.93 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   36 205.5      0    7569
Processing:     7  438 204.1    433    4249
Waiting:        6  437 204.0    432    4248
Total:          7  475 266.8    448    7604

Percentage of the requests served within a certain time (ms)
  50%    448
  66%    501
  75%    540
  80%    569
  90%    679
  95%    881
  98%   1186
  99%   1505
 100%   7604 (longest request)

With sync log writer:

config :annon_api, :plugin_pipeline,
  default_features: [:log_consistency]

Message queues are empty (almost always 0) and memory consumption is under 100 Mb.

ab -p ./post_loc.txt -l -n 100000 -c 100 http://127.0.0.1:4000/httpbin/post 
This is ApacheBench, Version 2.3 <$Revision: 1796539 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests

Server Software:        gunicorn/19.6.0
Server Hostname:        127.0.0.1
Server Port:            4000

Document Path:          /httpbin/post
Document Length:        Variable

Concurrency Level:      100
Time taken for tests:   487.047 seconds
Complete requests:      100000
Failed requests:        0
Non-2xx responses:      443
Total transferred:      1106356543 bytes
Total body sent:        406500000
HTML transferred:       1071172048 bytes
Requests per second:    205.32 [#/sec] (mean)
Time per request:       487.047 [ms] (mean)
Time per request:       4.870 [ms] (mean, across all concurrent requests)
Transfer rate:          2218.32 [Kbytes/sec] received
                        815.06 kb/s sent
                        3033.38 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   24 151.0      0    2265
Processing:    15  463 188.1    444    2339
Waiting:       15  460 187.4    442    2321
Total:         16  487 224.0    450    3230

Percentage of the requests served within a certain time (ms)
  50%    450
  66%    502
  75%    548
  80%    584
  90%    699
  95%    870
  98%   1158
  99%   1361
 100%   3230 (longest request)

Even tail-case latency is better now.

AndrewDryga commented 6 years ago

https://opencensus.io/

Nebo15 / annon.api

RFC: All new logging/tracing system #259