Huge LLVM line count, quadratic compile time for derive(JsonSchema)

adamchalmers commented 1 year ago

Hi there!

Firstly, thanks for this library. It's really helped a lot of the Rust web ecosystem.

I've been using derive(schemars::JsonSchema) on large enums for a while. By "large" I mean 200 or 300 enum variants, e.g. an enum CountryCode with variants for each ISO-3166 country code like US, AU, CN etc.

See here for an example enum (250 variants) which derives JsonSchema.

When I do this, I've noticed that the impl JsonSchema outputs three orders of magnitude more LLVM takes up 99% of my codebase's LLVM lines -- it really outputs a lot of LLVM.

  Lines                 Copies            Function name
  -----                 ------            -------------
  117114                18                (TOTAL)
  115921 (99.0%, 99.0%)  1 (5.6%,  5.6%)  playground::_::<impl schemars::JsonSchema for playground::CountryCode>::json_schema
     318 (0.3%, 99.3%)   1 (5.6%, 11.1%)  alloc::alloc::Global::alloc_impl
     171 (0.1%, 99.4%)   1 (5.6%, 16.7%)  <schemars::schema::SchemaObject as core::default::Default>::default
     164 (0.1%, 99.5%)   2 (11.1%, 27.8%) <alloc::boxed::Box<T,A> as core::ops::drop::Drop>::drop
     155 (0.1%, 99.7%)   1 (5.6%, 33.3%)  alloc::slice::hack::into_vec
      99 (0.1%, 99.8%)   1 (5.6%, 38.9%)  <schemars::schema::SubschemaValidation as core::default::Default>::default
      80 (0.1%, 99.8%)   1 (5.6%, 44.4%)  <schemars::schema::Metadata as core::default::Default>::default
      65 (0.1%, 99.9%)   1 (5.6%, 50.0%)  alloc::alloc::exchange_malloc
      56 (0.0%, 99.9%)   1 (5.6%, 55.6%)  <alloc::alloc::Global as core::alloc::Allocator>::deallocate
      25 (0.0%, 99.9%)   1 (5.6%, 61.1%)  alloc::boxed::Box<T>::new
      25 (0.0%,100.0%)   1 (5.6%, 66.7%)  alloc::str::<impl alloc::borrow::ToOwned for str>::to_owned
       8 (0.0%,100.0%)   1 (5.6%, 72.2%)  <T as core::convert::Into<U>>::into
       8 (0.0%,100.0%)   1 (5.6%, 77.8%)  <playground::_::<impl serde::de::Deserialize for playground::CountryCode>::deserialize::__FieldVisitor as serde::de::Visitor>::expecting
       8 (0.0%,100.0%)   1 (5.6%, 83.3%)  <playground::_::<impl serde::de::Deserialize for playground::CountryCode>::deserialize::__Visitor as serde::de::Visitor>::expecting
       8 (0.0%,100.0%)   1 (5.6%, 88.9%)  alloc::slice::<impl [T]>::into_vec
       2 (0.0%,100.0%)   1 (5.6%, 94.4%)  playground::_::<impl schemars::JsonSchema for playground::CountryCode>::schema_name
       1 (0.0%,100.0%)   1 (5.6%,100.0%)  <bool as core::default::Default>::default

The good news is, the derive(JsonSchema) macro outputs LLVM lines linear with the number of enum variants. So there's no hidden exponential or quadratic behaviour in the macro. Unfortunately, according to @jyn514, llvm optimizing is quadratic in number of lines in a function. So the derive(JsonSchema) outputs a huge number of LLVM lines, and it takes quadratic time to compile them. This means compiling the example I linked above takes an insane amount of time!

Luckily this behaviour only manifests in release mode. Debug builds are very quick!

I'm not familiar with LLVM and so I'm not really sure why JsonSchema derive expands to such a huge amount of LLVM lines. By comparison, the serde derives output 5 orders of magnitude less LLVM lines.

I guess you could view this as a problem with derive(JsonSchema) (that it outputs so much LLVM) or with LLVM (that it should not take quadratic time to compile in release builds). But we can probably fix schemars easier than LLVM.

Suggestions to fix:

Hacky workaround: break the large fn json_schema generated function into several smaller functions, to avoid the quadratic-in-lines-of-code behaviour from LLVM. Many small functions should be faster to compile than one large function.
Proper fix: figure out why fn json_schema generated function compiles into so many LLVM lines.

jyn514 commented 1 year ago

I'm not familiar with LLVM and so I'm not really sure why JsonSchema derive expands to such a huge amount of LLVM lines. By comparison, the serde derives output 5 orders of magnitude less LLVM lines.

it would be interesting to see the amount of generated MIR for each - the expansion you showed me had a lot of calls to ..SchemaObject::default and i wonder if it's generating a new assignment for every field in SchemaObject

you can use -Z unpretty=mir to see what the MIR is before LLVM lowering

adamchalmers commented 1 year ago

@jyn514 How do I use that flag? I've tried

cargo +nightly -Z unpretty=mir build
cargo +nightly build -Z unpretty=mir

and other combinations but it always just says "unknown -Z flag specified: unpretty"

jyn514 commented 1 year ago

@adamchalmers it's a rustc flag - try something like cargo +nightly rustc -- -Z unpretty=mir

adamchalmers commented 1 year ago

Thanks, here's the expansion.

The majority of MIR is made up of code like this (this pattern occurs 64744 times):

    bb64701 (cleanup): {
        drop(_300) -> [return: bb64702, unwind terminate(cleanup)];
    }

    bb64702 (cleanup): {
        drop(_278) -> [return: bb64703, unwind terminate(cleanup)];
    }

    bb64703 (cleanup): {
        drop(_256) -> [return: bb64704, unwind terminate(cleanup)];
    }

This takes up the vast majority of lines.

GREsau commented 11 months ago

Could you try schemars 0.8.16? It contains a small change that puts a temporary value in a variable instead of passing it directly as an argument to a function - when I tested it locally with your CountryCode example, this change reduced MIR output size by ~30%

I'm sure many further improvements could be made, but it seemed worth getting a quick minimal improvement out for now

adamchalmers commented 11 months ago

Thanks very much for that improvement -- now compiling kittycad takes 57 seconds down from 90 seconds, a big improvement!

If it's OK with you I'm going to keep this issue open so we can discuss further improvements -- I really appreciate the dramatic improvement so far!

saethlin commented 11 months ago

I don't know if this is already well-known, but I was reading Adam's great blog post about this situation: https://blog.adamchalmers.com/crazy-compile-time/ and I'm pretty sure that the compile time here would be effectively linear if the derive macro used a loop to build the array of variants.

Currently this code:

#[derive(schemars::JsonSchema, serde::Deserialize, serde::Serialize)]
pub enum CountryCode {
    #[serde(rename = "AF")]
    Af,
    #[serde(rename = "AX")]
    Ax
}

Expands to (I'm sure that's actually a vec!):

fn json_schema(
    gen: &mut schemars::gen::SchemaGenerator,
) -> schemars::schema::Schema {
    schemars::schema::Schema::Object(schemars::schema::SchemaObject {
        instance_type: Some(schemars::schema::InstanceType::String.into()),
        enum_values: Some(
            <[_]>::into_vec(
                #[rustc_box]
                ::alloc::boxed::Box::new(["AF".into(), "AX".into()]),
            ),  
        ),  
        ..Default::default()
    })  
}

But I'm suggesting that it expand to something like this:

fn json_schema(
    gen: &mut schemars::gen::SchemaGenerator,
) -> schemars::schema::Schema {
    schemars::schema::Schema::Object(schemars::schema::SchemaObject {
        instance_type: Some(schemars::schema::InstanceType::String.into()),
        enum_values: Some(
            ["AF", "AX"].into_iter().map(|v| v.into()).collect()
        ),  
        ..Default::default()
    })  
}

I know that's much easier to write in surface Rust than to make happen in a macro.

adamchalmers commented 5 months ago

On my real-world project (i.e. https://github.com/KittyCAD/kittycad.rs/), compile time is VASTLY improved!

0.8.19
real    36.21s
user    205.51s
sys 10.14s
maxmem  1,134,992k

0.8.17
real    68.42s
user    239.05s
sys     9.93s
maxmem  1,246,608k

Thank you so much @icewind1991 and @GREsau.

GREsau / schemars

Huge LLVM line count, quadratic compile time for derive(JsonSchema) #246