PaulSonOfLars / gotgbot

Autogenerated Go wrapper for the telegram API. Inspired by the python-telegram-bot library.
MIT License
511 stars 119 forks source link

Use io.Pipe for file uploads #194

Closed HirbodBehnam closed 2 months ago

HirbodBehnam commented 2 months ago

What

TL;DR: Use io.Pipe instead of buffering files in bytes.Buffer to reduce the memory usage of file uploads.

Before this commit, if there was a file(s) to be uploaded to Telegram, it would have been buffered in memory in a bytes.Buffer and then send to Telegram servers. This would have posed problems if you were uploading media groups because those can have up to ten files and thus, increase the RAM usage of this library.

The idea here is to asynchronously read from files and write them to the HTTP request. This can be achieved by using a goroutine and io.Pipe. As you can probably tell from the name, io.Pipe is basically a pipe which whatever is written to the writer part can be read from the reader part. However, if a data is written to pipe, it must be read before the next data (or data chunk) can be written again. Because io.Copy is used to copy the content of file to the multipart.Writer, the buffer of file content is 32KB and thus the memory usage of file buffering is around 32KB.

To execute this idea, we should create a io.Pipe, give the reader part to the http.NewRequestWithContext and the writer part to multipart.NewWriter. This means that another goroutine can write to the pipe (which is tied to multipart.Writer) and the request can read the content. This is exactly what I've done. As you can see, I've just ran fillBuffer in another goroutine. Also, FormDataContentType can be ran as soon as the multipart.Writer is created (unless the boundary is manually changed which is not the case here) which I've also done. The last thing which is left is error handling of the multipart.Writer. This can be done with PipeWriter.CloseWithError. If the writer part is closed with an error, the error will be sent to the reader part and the bot.Client.Do(req) will catch the error and print it with failed to execute POST request to ... error message. You also need to also defer the reader side closure to avoid the writer goroutine leak.

One other very small change I did was to swap the JSON marshaling in bytes.Buffer with fusing json.Marshal and bytes.Reader. Internally, json.Marshal uses a pool of bytes.Buffers and tries to reuse them as much as it can. Creating a new bytes.Buffer every time instead of using the buffer pool I think is a little bit slower. But if you want, I can simply use the old bytes.Buffer and json.NewEncoder.

Let me know what you think and I would like to thank you for the very niche idea you had to automate the code generation based on Telegram API docs!

Impact

HirbodBehnam commented 2 months ago

Thank you very much for the kind words! Also that's a good idea. Let me try this with my RedditDownloaderBot to see how it goes...

Edit: Now the bot is live with this pull request! I just added this line to go.mod and compiled it again:

replace github.com/PaulSonOfLars/gotgbot/v2 => github.com/HirbodBehnam/gotgbot/v2 v2.0.0-20240906045508-c060fa7cd0e3
PaulSonOfLars commented 2 months ago

Thank you very much for the kind words! Also that's a good idea. Let me try this with my RedditDownloaderBot to see how it goes...

Edit: Now the bot is live with this pull request! I just added this line to go.mod and compiled it again:

replace github.com/PaulSonOfLars/gotgbot/v2 => github.com/HirbodBehnam/gotgbot/v2 v2.0.0-20240906045508-c060fa7cd0e3

Exciting! Would it be possible for you to check your logs tomorrow and report back if there are any issues?

This may be me being overly cautious, so I apologise for that - I usually run most PRs on Rose for a few days before merging, but she doesn't do much file handling, so you seem to be better placed here!

HirbodBehnam commented 2 months ago

So I've got this running for about a day and half and I think it's good. So first things first, I used journalctl --since "2 days ago" -u redditdownloaderbot.service | grep "failed to execute POST request to" to check if there is any errors regarding file upload. Apparently, there are errors regarding file upload but the reason for that is the I've set the timeout of file uploads too low and thus, the logs are filled with context deadline exceeded. But no errors regarding the pipe stuff. Also, I had a look at Grafana charts to see the memory usage and there are clearly less spikes in memory usage compared to before this patch. (Also bear in mind that there are other services running on this server and time zone is in Asia/Tehran) image As you can see, after I deployed this version, the number of spikes has reduced.

So what do you think? Shall I do other experiments as well?

PaulSonOfLars commented 2 months ago

No, I'm all convinced - looks like a merge to me!

Really appreciate in-depth debugging detail, and the grafana dashboard you've provided - definitely shows an impressive improvement.

Thanks again for the excellent PR, and don't hesitate to open more PRs/issues if you've got any other suggestions in the future!