Qabel / qabel-drop

(B2C) :love_letter: Qabel messaging server
Other
17 stars 11 forks source link

FCM support #60

Closed enkore closed 8 years ago

enkore commented 8 years ago

Fixes #48

enkore commented 8 years ago

This branch has conflicts that must be resolved sigh

enkore commented 8 years ago

FCM works API-wise. It uses topic messages, so message routing is handled by FCM and no explicit routing and subscription-keeping is required on our side. The app just subscribes to the drop-ids it wants and that should be it. Note that in FCM there are no subscription limits on topics.

The data message format has two key-value pairs: drop-id is the drop id, message the drop message. Drop messages are encoded in base64, making them 3862 bytes long, plus some minor overhead, just so fitting inside the 4K limit. Testing showed that this is okay with the API.

The FCM API looks kinda slow from where I'm standing. Observe this histogram:

notify_fcm_duration_bucket{le="0.25"} 0.0
notify_fcm_duration_bucket{le="0.5"} 3.0
notify_fcm_duration_bucket{le="0.75"} 4.0
notify_fcm_duration_bucket{le="1.0"} 5.0

The pure API calls block the worker process for ~0.5 s.

WebSocket support via dwr didn't work reliably at any point. Either it sporadically lost messages (I was unable to debug why), deadlocked the worker (connection state doesn't seem to be handled correctly in dwr) or just didn't work at all. At this point I'm almost considering just moving the entire server to an asynchronous base supporting WS out of the box (say, tornado which we're using to great success in -block). That would also have the advantage that the FCM calls would only block an asyncore, not the entire process (assuming the client is monkey-patched to use an AIO http client). Because at this rate we are talking about 4-5 drops posted per second per worker, assuming that the FCM API is twice as fast from a real server as it is here. [It is also perfectly possible that Google is heavily throttling this API for residential internet connections to limit API abuse]

enkore commented 8 years ago
django_http_requests_latency_including_middlewares_seconds_bucket{le="0.005"} 4.0
django_http_requests_latency_including_middlewares_seconds_bucket{le="0.01"} 862.0
django_http_requests_latency_including_middlewares_seconds_bucket{le="0.025"} 1488.0
django_http_requests_latency_including_middlewares_seconds_bucket{le="0.05"} 1581.0
django_http_requests_latency_including_middlewares_seconds_bucket{le="0.075"} 1588.0
django_http_requests_latency_including_middlewares_seconds_bucket{le="0.1"} 1590.0

process_cpu_seconds_total 23.91

I can live with that for now (~70 posts/second/worker). Not great, but... might look a bit better if it were distributed among more than one drop, though.

btw. Summary doesn't seem to generate the quartiles/buckets here:

# HELP notify_fcm_api Number and duration of notify_topic_subscribers FCM API calls
# TYPE notify_fcm_api summary
notify_fcm_api_count{exception="None"} 1587.0
notify_fcm_api_sum{exception="None"} 1439.153717796944

13:01:06 up 17:08, 9 users, load average: 100.74, 63.47, 25.41 :)

enkore commented 8 years ago

btw. Summary doesn't seem to generate the quartiles/buckets here:

Prometheus docs:

First of all, check the library support for histograms and summaries. Full support for both currently only exists in the Go client library. Many libraries support only one of the two types, or they support summaries only in a limited fashion (lacking quantile calculation).

I guess we go back to Counter/Histogram combination or maybe just implement that. It's useful.