crossterm-rs / crossterm

Cross platform terminal library rust
MIT License
3.24k stars 278 forks source link

String formatting is slow. Commands like Print should write strings directly and not use write!. #628

Open markus-bauer opened 2 years ago

markus-bauer commented 2 years ago

Background

I was using Print to display the characters of a string one by one, instead of printing the entire string at once. This made me realize that using a lot of Print commands in a cycle is much slower than expected.

Problem

The flamegraph shows that most of the time is spent on core::fmt::write, which is called here (and in other commands): https://github.com/crossterm-rs/crossterm/blob/9a50fd2ce2701bb15f58722495024937143ad34d/src/style.rs#L401

It's known that the write macro (and all string formatting in rust) is considerably slower than writing directly: https://github.com/rust-lang/rust/issues/10761 https://github.com/rust-lang/rust/issues/76490

The issue is that style::Print doesn't actually need the macro because it doesn't format the string. I suspect that a lot of the commands could avoid the macro and just write the string directly.

Note, that there was a proposed change to the format macro that addresses this case (but it seems to be dead): https://github.com/rust-lang/rust/pull/76531

Apparently even creating a string with consecutive push_str() is faster than formatting: https://github.com/rust-lang/rust-clippy/issues/6926

I want to make it clear that it's just an assumption that avoiding the macro will improve performance. Unfortunately the crossterm codebase is designed so much around fmt that I found it difficult to test. The only quick test I could come up with is to just replicate the core issue. The code below writes a character n times to stdout, using either Print, write!, or write_all. It looks like it's up to 2x as fast without the macro, but the results are inconsistent between runs.

#![feature(test)]

use crossterm::{style::Print, terminal, QueueableCommand};
use std::io::{stdout, BufWriter, Write};

extern crate test;
use test::Bencher;

enum WriteCharacters {
    Format,
    Direct,
    Crossterm,
}
impl WriteCharacters {
    /// Writes a character n times.
    fn run(&self) {
        // NOTE: make the buffer big enough to avoid flushing.
        let mut bufwriter = BufWriter::with_capacity(100 * 1024, stdout());
        bufwriter.queue(terminal::EnterAlternateScreen);
        let text = "x";
        let n = 1600 * 900;
        for _ in 0..n {
            match self {
                Self::Crossterm => {
                    // using Print:
                    bufwriter.queue(Print(text));
                }
                Self::Format => {
                    // this is like Print and should give the same results:
                    write!(bufwriter, "{}", text);
                }
                Self::Direct => {
                    // this writes the string directly:
                    bufwriter.write_all(text.as_bytes());
                }
            }
        }
        bufwriter.queue(terminal::LeaveAlternateScreen);
        bufwriter.flush();
    }
}

#[bench]
fn with_crossterm(bh: &mut Bencher) {
    bh.iter(|| {
        WriteCharacters::Crossterm.run();
    });
}

#[bench]
fn with_format(bh: &mut Bencher) {
    bh.iter(|| {
        WriteCharacters::Format.run();
    });
}

#[bench]
fn without_format(bh: &mut Bencher) {
    bh.iter(|| {
        WriteCharacters::Direct.run();
    });
}
TimonPost commented 2 years ago

If it is faster, seems like an easy optimization. Although the point of write! is that it does not flush, but depending on the use of execute or queue it will be flushed manually.

Note that BufWriter does some internal optimizations to get the best write speed. Maybe this is not desirable for a benchmark.

Unfortunately the crossterm codebase is designed so much around fmt that I found it difficult to test.

It's a feature rather than a bug, crossterm has to deal with ANSI codes that are written to the terminal or any fmt::Writer, it is allowing users to pass in any kind of writer like your BufWriter.

markus-bauer commented 2 years ago

For clarification:

it is allowing users to pass in any kind of writer like your BufWriter.

Doesn't QueableCommand require writers that implement io::Write? And what do you mean by

Although the point of write! is that it does not flush

Isn't the flushing controlled by the io::Writer? It looks like the only difference between execute and queue is that flush is called after queue: https://github.com/crossterm-rs/crossterm/blob/9a50fd2ce2701bb15f58722495024937143ad34d/src/command.rs#L180-L184

You obviously know this better than me, but queueing seems to roughly work like this:

io_writer.queue() -> write_command_ansi(io_writer, command) -> command.write_ansi(&mut adapter) https://github.com/crossterm-rs/crossterm/blob/9a50fd2ce2701bb15f58722495024937143ad34d/src/command.rs#L188-L220 Adapter is a wrapper around the io writer with a impl fmt::Write. And the reason why fmt::Write is used, I assume, is to use string formatting to build the strings with ANSI codes (which apparently is slow). The write! macro then does the formatting and calls the Adapters' write_str(), which writes to the io:writer. So it seems that:

Perhaps I haven't understood it. I'll have a look at the crossterm code again next week, if I have time.

TimonPost commented 2 years ago

It looks like the only difference between execute and queue is that flush is called after queue:

Yes it is

Doesn't QueableCommand require writers that implement io::Write

Yes, QueueableCommand is implemented for all T where T implements io::Write and is not Sized. Same applies to ExecutableCommand.

Is to use string formatting to build the strings with ANSI codes

Yes, some codes require input others don't.

What about something like this:

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct PrintIo(pub &'static str);

impl Command for PrintIo {
    fn write_ansi(&self, f: &mut impl fmt::Write) -> fmt::Result {
        f.write_str(self.0)
    }

    #[cfg(windows)]
    fn execute_winapi(&self) -> Result<()> {
        panic!("tried to execute Print command using WinAPI, use ANSI instead");
    }
}

This uses write_str, which is implemented by the Adapter, which calls io::Write::write_all(). The do the following:

iowriter.queue(PrintIo("test"));

(not tested)

markus-bauer commented 2 years ago

Thanks for the quick and good responses.

As you yourself realized (I can see your edits); this doesn't work, because it still calls self.0.fmt:

impl<T: Display> Command for Print<T> {
    fn write_ansi(&self, f: &mut impl fmt::Write) -> fmt::Result {
        f.write_str(&self.0.to_string())
    }
    ...
}

impl<T: Display> Display for Print<T> {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        self.0.fmt(f)
    }
}

But the newest version does work. I added it verbatim and ran the same benchmark as above, with the addition of PrintIo. Like before, Print was about as fast/slow as write!. Now PrintIo is faster than those, and about as fast as write_all. So it seems to work. Although it's difficult to tell without a good benchmark.

By the way, the benchmark was meant to replicate a realistic use case, where I would use BuffWriter. Are there suggestions for a better benchmark?

Note also that a new command might not be the best solution. If the results are correct, then it might be desirable to eliminate formatting in as much places as possible.

These are the results (on my old and slow machine). They were comparable in for 4 consecutive runs and different n:

test with_crossterm     ... bench: 203,041,081 ns/iter (+/- 124,728,327)
test with_crossterm_new ... bench: 110,660,099 ns/iter (+/- 37,514,261)
test with_format        ... bench: 202,446,099 ns/iter (+/- 96,569,381)
test without_format     ... bench:  96,960,696 ns/iter (+/- 31,351,189)
markus-bauer commented 2 years ago

For completeness sake, here's the updated benchmark:

#![feature(test)]

use crossterm::{
    style::{Print, PrintIo},
    terminal, QueueableCommand,
};
use std::io::{stdout, BufWriter, Write};

extern crate test;
use test::Bencher;

enum WriteCharacters {
    Format,
    Direct,
    Crossterm,
    CrosstermIo,
}
impl WriteCharacters {
    /// Writes a character n times.
    fn run(&self) {
        // NOTE: make the buffer big enough to avoid flushing.
        let mut bufwriter = BufWriter::with_capacity(100 * 1024, stdout());
        bufwriter.queue(terminal::EnterAlternateScreen);
        let text = "x";
        let n = 1600 * 900;
        for _ in 0..n {
            match self {
                Self::Crossterm => {
                    // using Print:
                    bufwriter.queue(Print(text));
                }
                Self::CrosstermIo => {
                    // using PrintIo:
                    bufwriter.queue(PrintIo(text));
                }
                Self::Format => {
                    // this is like Print and should give the same results:
                    write!(bufwriter, "{}", text);
                }
                Self::Direct => {
                    // this writes the string directly:
                    bufwriter.write_all(text.as_bytes());
                }
            }
        }
        bufwriter.queue(terminal::LeaveAlternateScreen);
        bufwriter.flush();
    }
}

#[bench]
fn with_crossterm(bh: &mut Bencher) {
    bh.iter(|| {
        WriteCharacters::Crossterm.run();
    });
}

#[bench]
fn with_crossterm_new(bh: &mut Bencher) {
    bh.iter(|| {
        WriteCharacters::CrosstermIo.run();
    });
}
#[bench]
fn with_format(bh: &mut Bencher) {
    bh.iter(|| {
        WriteCharacters::Format.run();
    });
}

#[bench]
fn without_format(bh: &mut Bencher) {
    bh.iter(|| {
        WriteCharacters::Direct.run();
    });
}
TimonPost commented 2 years ago

So having Print with a generic T that can be any Display is valuable in my perspective. Print is just a convenient helper. The problem with PrintIo is that it only supports 'static str.

I just went and look through the code, it seems all types that can use write_str already do this. Perhaps all we can do is to add the PrintIo.

Types that already have this:

Types that cannot be changed:

Types that can be changed:

Type that require format:

markus-bauer commented 2 years ago

To summarize: We have commands that use static strings (like ResetColor), Commands that take impl Display (like Print and PrintStyledContent) and commands that convert integers to a string (like MoveTo or SetAttribute).

By Commands that cannot be changed, you mean Commands that should take impl Display? I can agree with that. This means it would be better to keep the more generic Print and introduce a less generic command like PrintStr.

But 'static str is not useful in practice. I would guess that most users want to print a &str/&String. Or at least they can do the formatting or conversion themselves and produce a &str.

I'm not very good at rust, but what about AsRef<str>, or something like that. AsRef<str> is implemented for str and String (https://doc.rust-lang.org/std/convert/trait.AsRef.html). (Disclaimer: We're reaching topcis where I don't know enough; so take this with caution.)

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct PrintStr<T: AsRef<str>>(pub T);

impl<T: AsRef<str>> Command for PrintStr<T> {
    fn write_ansi(&self, f: &mut impl fmt::Write) -> fmt::Result {
        f.write_str(self.0.as_ref())
    }

I tested the benchmark using a String, whereas I was using a static str before, with PrintStr and the resuls are the good. I also tried T: ToString but this is for some reason even slower.

About the other commands: I would say that this is a bigger issue, let's stick to Print for now. But these almost all do integer to string conversion. One would have to test if there are better ways to do this. A decision for the future is if you want to use rusts' formatting or not. It would be nice not to have this issue, this is really a rust problem (or at least a quirk) and not a crossterm problem. Perhaps this will be magically fixed a year from now. As a future topic: I haven't looked into it, but there are other formatting crates.

Finally: Are we rushing this? I still don't know for certain that this does improve performance for users. Have you done any benchmarking yourself to check if this is actually worth the time?