Streaming data - Githubissues

wazhar commented 7 years ago

Is there a way to stream "DATA command" payload. Looking at the code it seems data is kept in memory buffer and then passed it on to the backend processor.

While keeping data in memory gives you better performance. This could be a problem if you have too many connections or large payloads.

Any suggestions?

Thanks.

flashmob commented 7 years ago

Yes. That could be the next step in the evolution of the software.

At present, on Guerrilla Gail, there is no memory problem yet, however one limitation is that it can't support super large attachments. So streaming would solve this problem. Although not sure if we want to support super large attachments...

The main mail server has 128 GB of RAM and this server software only uses a small fraction of that. Also it can be upgraded to larger RAM in the future. Note the the server does make sure to recycle allocated RAM, so at least it doesn't abuse the garbage collector.

Another problem is that SMTP is like a transaction and you can never assume that the transaction will succeed. For example, you might stream and save each chunk, but suddenly the connection gets terminated. This means you might need to rollback all of what you've saved. You will also need to implement this logic in the processors somehow. Also, first few chunks of the stream would need to be processed to parse the headers.

Another pitfall might be that streaming might cause excessive I/O operations. Perhaps the chunk size should be configurable.

Streaming could ether be implemented via buffered pipe or a buffered channel, not sure which one to choose yet.

What kind of use case do you have? Are you finding the server will use too much RAM for your use case? How many connections do you expect?

Edit: Currently it should comfortably cruise with around 150 simultaneous connections, 100k emails per hour, with about 500 MB allocated, after running for a week. Although, our max message size is one 1MB... .

wazhar commented 7 years ago

Yes. In my use case the server is not dedicated to smtp transactions only. Its a compliance, which would be running whole bunch of stuff ( packet capture, http/https proxy, content decoders etc. ).

My goal is to accept emails and write to disk ( at a very high rate ). And do the rest with the emails later ( at a certain rate/bandwidth ).

I already have similar system written in java. I want to port it to golang to leverage openssl for better throughput and low latency, better GC (Ideally less GC pauses which I believe guerrilla and golang is good for) and low memory footprint. From what I understand goroutines are pretty light weight and scales better, in my case I want to spawn 100s of them.

Our appliance leverages PCIe SSDs which offers fast read/writes, so IO is not a big concern. Apart from having limited available RAM, large memory allocation and deallocation could also become a performance bottleneck unless you are using memory pools (I believe guerrilla does that , but not sure).

I think it is ok if somehow you can handover the Envelop with bufio reader instead of bytes.Buffer to the backend ( after DATA command is issued and after parsing the headers). In this case backend should be responsible for checking reader/writer errors and rollback ).

I see the complication, you must always parse headers which are always variable length. If you are working on the reader, this will move reader pointer to next readable byte which means it will either require rewinding the reader (which require keeping sufficient buffer sizes) or maybe parsed headers can be serialized back and written before writing the tail of the mime. It is just the though :). I am sure there are other complications, some of it you have already highlighted.

Thanks.

flashmob commented 7 years ago

Yes, buffer pools are recycled to avoid memory allocation / garbage collection. So the bytes.Buffer that is passed to the backend gets recycled for the next email.

Are you going to store the email using MailDir? How about deduplication?

Yes, the headers can be parsed while reading, and then be added when writing out email. Similar technique is used on envelope.go NewReader() function, it returns an io.MultiReader which is composed of two readers, one that writes out the delivery header while the other reader joins the body.

Currently the DATA is read using the DotReader found in go's textproto package. For streaming, instead of a bytes buffer, we may need to use io.Pipe(), so the we can pass the reader end of the pipe to the backend, and then the backend would read from the pipe in a new goroutine. A quick PoC example here: https://play.golang.org/p/pX6XO9AdVw (note, example uses a new goroutine for writing)

Sorry for brevity, still would need to flesh out other details...

flashmob commented 5 years ago

Update: Just posted a PR for the ability to stream the DATA command. See details https://github.com/flashmob/go-guerrilla/pull/135

flashmob / go-guerrilla

Streaming data #84