carbon-language / carbon-lang

Carbon Language's main repository: documents, design, implementation, and related tools. (NOTE: Carbon Language is experimental; see README)
http://docs.carbon-lang.dev/
Other
32.25k stars 1.48k forks source link

Extremely large integers have slow parsing performance in toolchain #980

Open jonmeow opened 2 years ago

jonmeow commented 2 years ago

Large integers (10k+ digits) have sufficiently poor parsing performance that we're seeing it cause timeouts for the toolchain fuzzers.

The slow code seems to be in getAsInteger, called from numeric_literal.cpp (permalink)

From some digging, I think the issue is that APInt has visibly slow performance when it's dealing with these large integers.

Examples can be found at https://github.com/carbon-language/carbon-lang/pull/979/files

One solution I'm considering is just having fuzzers examine input for these large numbers, and discard them. e.g., exit quickly if fuzz input contains 1k digits. But I may take a quick look to see if I can improve parsing performance first.

jonmeow commented 2 years ago

Note, if others have suggestions here, I'm all ears -- I don't intend to spend very long looking at the performance of this, just enough to see if I can quickly get something better.

chandlerc commented 2 years ago

I think we should just limit decimal number size completely.

As far as I can tell, forming a decimal number is quadratic in the number of digits somewhat fundamentally. The reason is because for every extra digit we have to scale all prior parsed digits by 10. The complexity of this seems intrinsically linear because of the base conversion. Maybe there is a super clever algorithm that finds cut points where the scale can be pre-computed in some way without doing a high-precision multiply (with a linear number of carries), but quick searching doesn't show anything. Mostly ways of extracting a constant factor by doing some number of digits in parallel.

So I think we should just pick an upper limit on the number of decimal digits that are valid in an immediate. I would suggest the next power of 10 beyond 2^128 personally. It's arbitrary, but I don't think we can do much beyond that. My inclination is to limit this in the language itself rather than fuzzer because these can legitimately make compile times slow.

For hex (and octal and binary) the situation is much better -- at least in theory. It should be easy to linearly convert these to a binary bigint representation.

That said, LLVM's code in getAsInteger doesn't do this. It does a quadratic operation even for power-of-two bases. It is a much faster quadratic operation though and so may not matter in the short term. Eventually we should fix that code to use a more fundamentally efficient algorithm. It doesn't seem hard at all though so if we end up seeing this in practice that seems like a workable fix rather than introducing a limit on those literals.

tkoeppe commented 2 years ago

Would it help to only parse integer constants lazily? That is, just validate them lexically and store the string data for later, and only actually parse it when needed -- at which point you can quickly check whether the constant is too large for its intended use?

jonmeow commented 2 years ago

I would prefer to avoid lazy parsing; I worry about the complexity it can add, and I think it also leaves edge cases where large values are still necessarily parsed:

I don't even think performance is very bad at the 1k digit range, it's really the larger values that become an issue.

Small note on potential getAsInteger improvements, I was thinking even for decimal it might be more efficient to parse in 64-bit groups (e.g., 19 digits at a time for base-10), then multiply the 64-bit groups together. While still quadratic, I suspect it'd end up be more efficient because fewer APInt multiplications occur.

jonmeow commented 2 years ago

I'm going to leave this open because it's still an issue, and we could improve on it (improving LLVM as well), but for now the limit should work well enough.

github-actions[bot] commented 2 years ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please comment or remove the inactive label. The long term label can also be added for issues which are expected to take time. This issue is labeled inactive because the last activity was over 90 days ago.

minhdv82 commented 1 year ago

Has this issue been resolved? If not, I'd like to give it a try.

chandlerc commented 1 year ago

We added a limit, but as the last real update indicates -- there is still work that we could do to improve this.

For example, from my original comment:

For hex (and octal and binary) the situation is much better -- at least in theory. It should be easy to linearly convert these to a binary bigint representation.

That said, LLVM's code in getAsInteger doesn't do this. It does a quadratic operation even for power-of-two bases. It is a much faster quadratic operation though and so may not matter in the short term. Eventually we should fix that code to use a more fundamentally efficient algorithm. It doesn't seem hard at all though so if we end up seeing this in practice that seems like a workable fix rather than introducing a limit on those literals.

And from one of Jon's comments:

Small note on potential getAsInteger improvements, I was thinking even for decimal it might be more efficient to parse in 64-bit groups (e.g., 19 digits at a time for base-10), then multiply the 64-bit groups together. While still quadratic, I suspect it'd end up be more efficient because fewer APInt multiplications occur.

Both of these would involve working with the LLVM community to make getAsInteger more efficient. Once that lands, we should automatically benefit when we update LLVM.

niekbouman commented 10 months ago

Maybe there is a super clever algorithm that finds cut points where the scale can be pre-computed in some way without doing a high-precision multiply (with a linear number of carries), but quick searching doesn't show anything. Mostly ways of extracting a constant factor by doing some number of digits in parallel.

Hi @chandlerc & others,

One idea could be to compute the binary representations of (large) powers of ten by means of a convolution in the Fourier domain using an FFT (which would give $O(n \log n)$ complexity). Would that be useful?

Example: The binary decomposition of 10 is 8+2. From this we can compute that the bit-decomposition of 100 is $(2 + 8)^2 = 4 + 32 + 64$, i.e. a convolution between the arrays [8,2] and [8,2]. Or, using Python/Numpy:

import numpy as np
x = np.fft.fft([2,8,0]) # note that we use zero-padding here to 2k - 1 where k is the input length
np.fft.ifft(x*x)
niekbouman commented 10 months ago

Hmm, an FFT is probably overkill, but I was thinking along the following lines: Would something like the code below improve things if you replace the integer additions and shifts by their corresponding ops on APInts?

https://godbolt.org/z/YdbrfG1b3

#include <cassert>
#include <cstdint>
#include <functional>
#include <string>

uint64_t str_to_num(std::string x) {

    uint64_t accum{};
    uint64_t pow_of_ten{1};

    std::for_each(x.rbegin(), x.rend(),[&](char c){ 

        switch(c - 48){
            case 1:
              accum += pow_of_ten;
              break;
            case 2:
              accum += pow_of_ten << 1;
              break;
            case 3:
              accum += pow_of_ten + (pow_of_ten << 1);
              break;
            case 4:
              accum += pow_of_ten << 2;
              break;
            case 5:
              accum += pow_of_ten + (pow_of_ten << 2);
              break;
            case 6:
              accum += (pow_of_ten << 1) + (pow_of_ten << 2);
              break;
            case 7:
              accum += pow_of_ten + (pow_of_ten << 1) + (pow_of_ten << 2);
              break;
            case 8:
              accum += pow_of_ten << 3;
              break;
            case 9:
              accum += pow_of_ten + (pow_of_ten << 3);
              break;
        }
        pow_of_ten = (pow_of_ten << 1) + (pow_of_ten << 3);
    });

    return accum;
}

int main()
{
        std::string x = "9876543210";
        assert(str_to_num(x) == 9876543210U);
}
jonmeow commented 10 months ago

@niekbouman A good benchmark to compare on would probably be a base-10 number with 10k digits. We have a 1k digit limit right now, and I'm not sure how many cases would exceed that. Here's a couple examples around 100 digits (I couldn't find larger): https://github.com/boostorg/multiprecision/blob/develop/test/sincos.ipp https://github.com/abseil/abseil-cpp/blob/master/absl/strings/internal/str_format/convert_test.cc#L831

For reference, see LLVM's consumeInteger implementation to compare performance. I think you'd see an improvement, although benchmarking would be the best way to compare.

Note, since consumeInteger does APInt operations per-digit, there are other scaling solutions that break out of per-digit scaling. For example, parsing a cluster of digits at a time into a uint64_t, then merging them as a group into the APInt. Or similarly, taking each uint64_t and combining pairs into APInts, then combining pairs of APInts until only the result remains (though this may add more complexity than its worth).

niekbouman commented 10 months ago

My approach does not help much; but it turns out that flint (https://flintlib.org/) has a faster implementation. (Its LGPL, so probably not directly usable for LLVM?) Don't know what they do, seems to be multi-threaded and recursive from quick glance at their source code.

Another idea would be to use polynomial evaluation by viewing the number $\ldots d_3d_2d_1d_0$ as the polynomial $\ldots + d_3 x^3 + d_2 x^2+ d_1 x + d_0$ in the point x = 10 , using Horner or divide-and-conquer (see flint's polynomial evaluation, https://flintlib.org/doc/fmpz_poly.html#evaluation ), edit: benchmarked this and it is 4x slower than the fmpz_set_str function.

2023-11-14T16:46:06+01:00
Running ./a.out
Run on (8 X 800.829 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 1.54, 1.41, 2.02
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
BM_FromStringRef   44526519 ns     44126787 ns           15
BM_NewApproach     42482881 ns     42106288 ns           16
BM_Flint             276391 ns       272561 ns         2508

my benchmark:

//compile with:
// clang++ -lLLVM-16 -O3  -march=x86-64-v3 -lgmp -lbenchmark -lflint parse.cpp
//
#include <benchmark/benchmark.h>
#include <cassert>
#include <cmath>
#include <cstddef>
#include <cstdint>
#include <flint/fmpz.h>
#include <functional>
#include <gmp.h>
#include <iostream>
#include <llvm/ADT/APInt.h>
#include <llvm/ADT/ArrayRef.h>
#include <llvm/ADT/SmallString.h>
#include <llvm/ADT/StringRef.h>
#include <random>
#include <string>
#include <string_view>
#include <vector>

llvm::APInt str_to_num_flint(unsigned bits, std::string const &x) {
    fmpz_t n;
    mp_ptr p;
    fmpz_init(n);
    fmpz_set_str(n, x.c_str(), 10);
    size_t limbs = fmpz_get_mpn(&p, n);
    llvm::APInt acc{bits, llvm::ArrayRef<uint64_t>(p, limbs)};
    fmpz_clear(n);
    return acc;
}

void add_inplace(std::vector<uint64_t>& a, std::vector<uint64_t> const& b, size_t size)
{
    //from gmp: https://gmplib.org/manual/Low_002dlevel-Functions
    mpn_add_n(a.data(), a.data(), b.data(), size);
}

void shl(std::vector<uint64_t>& r, std::vector<uint64_t> const& b, size_t limbs, int pos)
{
    //from gmp: https://gmplib.org/manual/Low_002dlevel-Functions
    mpn_lshift(r.data(), b.data(), limbs, pos);
}

llvm::APInt str_to_num(unsigned bits, std::string_view x) {

    size_t limbs = (bits + 63) / 64;
    std::vector<uint64_t> accum(limbs);
    std::vector<uint64_t> tmp(limbs);
    std::vector<uint64_t> pow_of_ten(limbs);
    pow_of_ten[0] = 1;
    using llvm::APInt;

    size_t curr_bits = 0;
    std::for_each(x.rbegin(), x.rend(), [&](char c) {
      curr_bits += 8;
      size_t curr_limbs = std::min(limbs, (curr_bits + 63) / 64);

      uint8_t d = c - 48;

      if (d & 1) {
        add_inplace(accum, pow_of_ten, curr_limbs);
      }
      shl(tmp, pow_of_ten, curr_limbs, 1);
      pow_of_ten = tmp;

      if (d & 2) {
        add_inplace(accum, tmp, curr_limbs);
      }
      shl(tmp, tmp, curr_limbs, 1);
      if (d & 4) {
        add_inplace(accum, tmp, curr_limbs);
      }
      shl(tmp, tmp, curr_limbs, 1);
      add_inplace(pow_of_ten, tmp, curr_limbs);
      if (d & 8) {
        add_inplace(accum, tmp, curr_limbs);
      }
    });
    APInt acc{bits, llvm::ArrayRef<uint64_t>(accum.data(), accum.size())};
    return acc;
}

std::string generate_rnd_radix10_string(size_t digits) {
  std::default_random_engine e;
  std::uniform_int_distribution<uint8_t> leading_digit(1, 9);
  std::uniform_int_distribution<uint8_t> other_digits(0, 9);
  std::string x;
  x.reserve(digits);
  x.push_back(leading_digit(e) + 48);
  --digits;
  while (digits > 0) {
    x.push_back(other_digits(e) + 48);
    --digits;
  }
  return x;
}

static void BM_FromStringRef(benchmark::State &state) {
  // Perform setup here
  std::string x = generate_rnd_radix10_string(10000);
  int bits = std::ceil(x.size() * std::log2(10));

  // do a small unit test during setup
  llvm::APInt t(bits, llvm::StringRef(x), 10);
  llvm::APInt t2 = str_to_num(bits, x);
  llvm::APInt t3 = str_to_num_flint(bits, x);
  assert(t == t2);
  assert(t == t3);

  for (auto _ : state) {
    // This code gets timed
    llvm::APInt y(bits, llvm::StringRef(x), 10);
    benchmark::DoNotOptimize(y);
  }
}

static void BM_NewApproach(benchmark::State &state) {
  // Perform setup here
  std::string x = generate_rnd_radix10_string(10000);
  unsigned bits = std::ceil(x.size() * std::log2(10));

  for (auto _ : state) {
    // This code gets timed
    llvm::APInt y = str_to_num(bits, x);
    benchmark::DoNotOptimize(y);
  }
}

static void BM_Flint(benchmark::State &state) {
  // Perform setup here
  std::string x = generate_rnd_radix10_string(10000);
  unsigned bits = std::ceil(x.size() * std::log2(10));

  for (auto _ : state) {
    // This code gets timed
    llvm::APInt y = str_to_num_flint(bits, x);
    benchmark::DoNotOptimize(y);
  }
}

// Register the functions as benchmarks
BENCHMARK(BM_FromStringRef);
BENCHMARK(BM_NewApproach);
BENCHMARK(BM_Flint);

BENCHMARK_MAIN();
jonmeow commented 10 months ago

Thanks for the benchmark. :)

(Its LGPL, so probably not directly usable for LLVM?)

For better or worse, I would expect that to be an issue. LGPL is a nice license in general, but LLVM sticks to its own "Apache 2.0 with LLVM Exceptions" license (for non-test code at least), and Carbon matches LLVM. It usually means LLVM implements its own libraries.

niekbouman commented 10 months ago

@jonmeow See https://groups.google.com/g/flint-devel/c/ZwDnuMCsnFU

niekbouman commented 9 months ago

BTW I realised that in my benchmark I test APInt's constructor that takes a StringRef, while you refer to getAsInteger which actually goes through another code path (which seems, by inspection, equally slow).

Anyway, my higher-level question is why does llvm::APInt not (unlike GNU MP) implement fast algorithms for basic operations like multiplication (besides not having a faster algorithm for parsing integer literal strings)? E.g., if I understand correctly, APInt only implements "schoolbook" multiplication, but not the algorithms that are faster for longer bit lengths, like Toom–Cook (or the Schönhage–Strassen algorithm for very big numbers).

In case you can use MIT-licensed code in your project, then hacl-star provides a MIT-licensed and formally verified implementation of Karatsuba multiplication (Toom-2).