Keats / tera

A template engine for Rust based on Jinja2/Django
http://keats.github.io/tera/
MIT License
3.49k stars 282 forks source link

Streamable templates #211

Closed ghost closed 4 years ago

ghost commented 7 years ago

Currently, Tera::render returns a string, which comes from Renderer::render, which builds the string by rendering each component and then using String::push_str.

Unfortunately for extremely large templates, this uses a lot of memory and might take a lot of time, and the biggest problem is that rendering is a blocking operation. It might be smarter to have a different render method that returns a struct that implements std::io::Read so it can be streamed and not store the entire response in the heap before sending it over to the server.

(I'm obviously talking about using tera in conjunction with a web application framework as opposed to tera alone, but let's be honest, that's 99% of tera's use cases)

ghost commented 7 years ago

I think a potential solution would be to map |node| { &self.render_node(node).chain_err(|| self.get_error_location())? } over the ast and then have some other struct that implements the std::io::Read trait for an arbitrary iterator of strings or characters.

Keats commented 7 years ago

(I'm obviously talking about using tera in conjunction with a web application framework as opposed to tera alone, but let's be honest, that's 99% of tera's use cases)

Ah I'm not actually using Tera in a webserver :p

How big are the templates using lots of memory? I would consider that a bug if it's too memory hungry/slow. Can you add such a template to the benchmarks folder?

ghost commented 7 years ago

It's not so much the templates themselves being too big, but the data fed into the template's context. My specific use case just looks like {% for x in stuff %}<div>{{x}} example idk</div>{% endfor %}, where stuff is just a huge iterator. Memory spikes as high as gigabytes when using the current rendering algorithm, even though the final output is about half the size of the spike. And it's a blocking operation too; the majority (>90%) of the time that the request takes is spent on rendering, not processing the stuff. It's pretty scary.

I haven't investigated this closely, but I'm unsure why there would be any problem rendering the node itself after the AST is built (i.e. my template is syntactically well-formed) and the values from the context are semantically compatible with my template (e.g. if I said {% for a, b in X %}, then tera has already knows that X is a hashmap or whatever). Or in other words I'm not sure why there's a ? in &self.render_node(node).chain_err(|| self.get_error_location())?. I think there may be a way to lazily evaluate each node, almost as if we were in-order-traversing a tree, returning one rendered portion at a time in the style of an iterator. Does this sound possible?

mexus commented 7 years ago

I beg your pardon, Riamse, but what are you trying to achieve with "streaming" the stuff? I mean, in the end you will have to post everything to the webpage and wait until it's rendered. Everything's done withing a single thread, so how do you think to gain any performance with this move? How lazy evaluation will make things run faster? Maybe I'm missing something.

Also I would suggest to look into dynamic loading, i.e. you render a page with no data then use some asynchronous mechanism to fetch the data on demand.

Another thing I can tell you that serialization is not an instant process as well (and you want your data to be serialized in order to be rendered), so that load-on-demand will help here as well. It's a more complex design to implement, but a more scalable one.

ghost commented 7 years ago

With all due respect, @mexus , I think you may have misunderstood the issue. I'm not concerned about speed of rendering, per se, and dynamic loading will not solve the problem.

My problem is that the tera templating engine stores the entire evaluated template as a string on the heap, and then returns the string. This can potentially use a lot of RAM. I propose, in effect, that when you call Renderer::render, instead of the method evaluating every single node and putting it all into a string, the method returns a T: Iterator<Item=String>, and doesn't bother to render a single node. Rather, each item in the iterator is one node, rendered only when needed.

Now, I don't know if you have any experience with Python, but here's an analogy to illustrate. What Tera does right now is:

def render(self, *args_idk):
    output = []
    for node in self.ast:
        output.append(node.render())
    return output

The feature I propose is roughly equivalent to this:

def render(self, *more_args):
    for node in self.ast:
        yield node.render()

The second one only renders nodes as needed, while the former gives me all of the nodes at once. If the web server transmits data to the client like so:

server = the_server
client = my_laptop
for piece in tera.render():
    server.send(client, piece)  # send piece to the specified client

with the current implementation, at any given moment, every single node will be stored in RAM at the same time. With my proposed implementation, at any given moment, only one node will be stored in RAM.

Here is a concrete example to illustrate what I'm saying, because what I propose is a similar strategy used when serving static files. Ubuntu hosts a 1.5 gigabyte .iso of the operating system on their own server. I am 100% sure that they don't transmit it to my computer by loading every 1.5GB of data into RAM at once and then iterating over it. What they do, I am sure, is:

  1. x=1
  2. read byte x and store it in RAM
  3. send byte x to the client
  4. delete byte x from RAM
  5. x++
  6. repeat steps 2-5 as needed

Thus, at any given time there is only one small portion of the file in RAM throughout the lifetime of the program. In fact, there's even a HTTP header just for this purpose to make it easier.

Forgive me if this sounds patronising, as I want to ensure my communications to be 100% understood. Is my explanation to your satisfaction?

Keats commented 7 years ago

@Riamse I would be open to adding that in Tera but as another method than render, something like render_chunks or whatever.

ghost commented 7 years ago

That's fair.

Looking at the code, it seems difficult to implement it using the current structure due to how render_node, render_for, render_if, etc. all recursively call each other. I think what we need is a way to basically flatten the AST into an iterator, which lazily evaluates each node in the AST to give us only leaf Nodes (Text, Int, Float, Bool, Raw, and the like; these are the easy to render ones), which should solve the problem at its core. All that's left is mapping |node| -> String { node.render() } and then optionally wrap iter-read around it, for the std::io::Read trait.

Keats commented 7 years ago

I'm planning to rewrite all the parsing with pest 1.0 for the next version soon but I'm not entirely sure how a flat AST would look like for the rendering. I'll go through https://github.com/Keats/tera/issues/206 first + rewrite the lexing/parsing and then we can see about flattening it

ghost commented 7 years ago

Great, sounds like a plan

ghost commented 7 years ago

https://github.com/rust-lang/rfcs/blob/master/text/2033-experimental-coroutines.md relevant

Keats commented 7 years ago

@Riamse if you want to have a look, the next version is in #218 It's not 100% done (still missing error reporting and docs update mostly) but the parser/renderer are done

ghost commented 7 years ago

I took a look but I'm not sure how it works. Could you give me a quick rundown of what happened, how to use it, differences, etc.?

Keats commented 7 years ago

Sorry I should have explained more:

Other than that the public API of Tera doesn't change, you can use it as before. I might do a few more changes but nothing dramatic. The changelog has a bit more details: https://github.com/Keats/tera/pull/218/files#diff-4ac32a78649ca5bdd8e0ba38b7006a1eR3

ghost commented 7 years ago

Okay I looked at it and I think I have a better idea of what to do. I think what this issue is really about is about how for loops are rendered. Currently render_for just evaluates the whole loop (which involves recursion - this is the biggest issue) and returns a giant string.

This is a problem and I think I might know how to fix it, but I need to understand how render_for works, how it relates with self.for_loops and self.macro_context, etc.

Keats commented 7 years ago

So the idea of render_for is to create a new context based on the context of what is currently being rendered + the local variable + the forloop specific variables. Each time it encounters a forloop, it creates an object (ForLoop) that will store all the data for the loop + any extra values only present in the forloop body such as a {% set = %} in it. To know which context to use when using variables, we need to keep track on whether the forloop is in a macro (self.macro_context) or the template being rendered (self.for_loops). This is used in lookup_ident when needed. I don't think there is anything more to the render_for but I'm not sure where you see the recursion in there?

ghost commented 6 years ago

Got it.

I need to make the AST mutable. Specifically in a line like this.

        let ast = match self.template.parents.last() {
            // this unwrap is safe; Tera will have errored already if the template doesn't exist
            Some(parent_tpl_name) => &self.tera.get_template(parent_tpl_name).unwrap().ast,
            None => &self.template.ast,
        };
        ast.push(Node::Text("hello".to_owned()));  // what I want to be able to do

I believe I have a solution to this problem but it requires manipulating the AST while rendering occurs.

ghost commented 6 years ago

I mean to say that I can't make the ast mutable right now - trying to make the match block return &mut self.tera.get_template(parent_tpl_name).unwrap().ast doesn't work, and leads me down a rabbit hole of making lots of things mutable which is painful.

ghost commented 6 years ago

I'm doing a proof of concept

ghost commented 6 years ago

Also, I think the way that the for loops works is kind of bad. I think instead of keeping track of the variables in the for loop variables themselves, it'd make more sense to just have a Context be able to store variables in scopes: Something like struct Context { data: VecDeque<BTreeMap<String, Value>> } where data[0] is the global scope and data[n] is the variables declared at n levels of nesting. Entering a for loop appends a new BTreeMap, and exiting a for loop pops a BTreeMap off the end of the list. and if you want to lookup variable hello, you check data[n], then data[n-1], then data[n-2], etc.

Am I making sense? (https://github.com/Riamse/chom-c/blob/master/grammar.py#L83 and https://github.com/Riamse/chom-c/blob/master/c.py are examples of what I'm talking about)

ghost commented 6 years ago

To know which context to use when using variables, we need to keep track on whether the forloop is in a macro (self.macro_context) or the template being rendered (self.for_loops).

Oh. Forget I said anything then.

ghost commented 6 years ago

but I'm not sure where you see the recursion in there?

basically, render_for calls render_body, which might call render_for, etc etc. that's what I meant by recursion; but fear not, I have an idea in the works. Watch this space.

ghost commented 6 years ago

@Keats please look at the commit I just committed because it passes almost all the tests and fixes the thing and I am proud

ghost commented 6 years ago

renderer::tests::inheritance::render_super_in_top_block_errors does in fact fail, but that's because of my decision to use Option<T> instead of Result<T, E> resulting in ambiguity: when it fails, it returns None, but None is also used for no-ops, so step_once() can't distinguish between the two. This is totally fixable though.

What do you think?

ghost commented 6 years ago

https://github.com/Riamse/tera/commit/4be7b1a12665a1435ebf29a36fe7f9def791a1ff this one too

Keats commented 6 years ago

I'll have a look at the code tomorrow - any benchmarks for memory usage between the render and iter_render?

Keats commented 6 years ago

And for renderer::tests::inheritance::render_super_in_top_block_errors I would still want it to fail otherwise it's hard to provide a good error message there

I'm also curious if you can run the current benchmark to see if there's any speed difference too.

ghost commented 6 years ago

I ran the benchmarks. https://pastebin.com/W6E2NApQ for the current method of rendering things and https://pastebin.com/sEvEQxRH for iter_render. I am using Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz.

I'm not sure how to measure memory usage. But basically, if your goal is to print a rendered template and not store it anywhere, iter_render should win by a landslide because you can do a series of prints on small chunks of the output string, as opposed to moving the whole thing into memory and printing it at once.

Keats commented 6 years ago

I asked on Twitter (https://twitter.com/20100Prouillet/status/934705940758253568) for the memory usage. I have to say I'm not entirely convinced though: it is significantly slower, duplicates a good chunk of the renderer and adds some not-really existing nodes in the AST. I'll wait to see if someone replies on how to test the benchmark memory so we have all the numbers to make decisions.

ghost commented 6 years ago

I'm not sure why it's slower. The only overhead should be the cost of allocating a VecDeque and skipping over no-ops.

Seems that there's also a small improvement by using a Vec instead.

ghost commented 6 years ago

Ah, one more bit of overhead - cloning body nodes (e.g. ForloopPrime). I'm not sure if there's a way around that other than reference-counting.

I have a minor issue with "not-really existing nodes" - they do, in fact, exist already, represented by the nested relation that comes with nodes having bodies. However by flattening the nested bodies piece-by-piece, we need to preserve that relation. Think of X' nodes as various types of closing brackets.

seanlinsley commented 6 years ago

Even if template streaming is marginally slower, from my perspective it's worth the cost because web application response times are most commonly slowed down by expensive / poorly optimized database queries, not by the web application itself mis-using CPU or memory.

While template streaming doesn't lower the amount of time it takes for the server to render its response, it does allow the browser to eagerly load render-critical resources, and possibly render the navigation UI. Especially on mobile, having to wait for render-critical assets to load after the HTML can easily add 5 seconds to the total render time (especially if the page requires JS).

Keats commented 6 years ago

That's not the same kind of streaming referenced in this issue though. The PoC in that issue is basically to make the loops non-recursive to save some memory but it's not streaming: you still end up with a String. What I believe you want @seanlinsley will need generators so you can implement something similar to https://github.com/pallets/jinja/blob/master/jinja2/environment.py#L1191

ghost commented 6 years ago

No, it is the same kind of streaming. The PoC does in fact return a String because that's what the tests expect as output. If you really wanted, you could duct-tape together something that returns an Iterator<Item=String> out of the Renderer's AST and step_once() (step_once is where the real magic happens).

Keats commented 6 years ago

That is not specific to your branch though, you could apply this kind of streaming to the current render. The difference would be that an entire forloop would result into a single value instead of one per iteration

ghost commented 6 years ago

The current render wouldn't be able to do this for more than one level. Suppose that my template is following block:

{% if 1==1 %}
    {% if 1 == 1 %}
        hello
    {% endif %}
    {% if 1 == 1 %}
        world
    {% endif %}
{% endif %}

The AST looks like this: [If(condition: 1==1, body: [If(condition: 1==1, body: ["hello"]), If(condition: 1==1, body: ["world"])])]

I can't figure out a way to stream ["hello", "world"] without introducing generators into the current implementation.

Keats commented 6 years ago

The latest release of Tera is much more memory efficient: it's still not streaming but it shouldn't allocate the memory for the string now, no more cloning around

ghost commented 6 years ago

I took a quick look at the code - the rendering work has been moved to https://github.com/Keats/tera/blob/master/src/renderer/processor.rs it seems? But I'm not entirely sure exactly what happened, because it seems to me that it's still allocating memory for strings at each nested level of rendering, but now we copy them into the output = String::with_capacity(10000) and then destroy them? Could you explain a bit what happened?

Keats commented 6 years ago

My last message was confusing even for me now, sorry about that.

So it doesn't do anything to help with the string allocations, the main change was using references to the original context instead of clone() lots of things. For example, in the previous versions a forloop would clone all the values used and potentially clone some part again later while now it is using Cow. In practice that means allocations only happen on filter/set calls now so it is quite a bit faster and memory usage is lower.

ghost commented 6 years ago

Can we look into adding template streaming to the new renderer architecture?

Keats commented 6 years ago

That would depend on https://github.com/Keats/tera/issues/340 and whether it requires duplicating the codebase or not

Keats commented 4 years ago

Unlikely to happen anytime soon, maybe with async/await in the future.