dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.2k stars 1.57k forks source link

building long strings through concatenation is quadratic in complexity #50304

Open bksubhuti opened 1 year ago

bksubhuti commented 1 year ago

I have a dictionary with 200,000 records. Normally I process these records in js and I can read and write an 18MB file within a few seconds. (Note:.. older file about 168,000).

However, when I tried doing this in Dart for our re-written product, I ran into problems with .fromJson parsing. There seemed to be a limit of less than 90,000.

Further investigation had me try processing a csv manually. Further investigation had me write a csv manually in case my data was bad. Further investigation had me write a dart cli ...

The code is shown below and takes roughly 1 second per 1000 merely to do string processing in memory. The flutter code of the same thing takes about 2.5 seconds per 1000. So in flutter (for my real life dictionary import feature), we are talking about 600 seconds or 10 minutes. I have not ran any of these scripts to completion. They appear as "hung" even though they are just taking a long time.

Meanwhile.. I can read an sql file and pass db.Update(sql); in a few seconds. For now, I will need to store my data online for downloading dictionary updates as Raw SQL transaction inserts.

I never thought I would report speed as a bug. But when we are talking a few seconds in js versus 10 minutes, this appears as "hung" and counts as a bug. I love dart flutter, but had I known about this limitation, I might have opted for something else. on the other hand the framework and using db's does not require us to feel this pain until now.

Hardware: Lenovo IdeaPad 3 14ALC6 Ubuntu 22.04.1 [✓] Flutter (Channel stable, 3.3.5, on Ubuntu 22.04.1 LTS 5.15.0-52-generic, locale en_US.UTF-8)

$ flutter --version
Flutter 3.3.5 • channel stable • https://github.com/flutter/flutter.git
Framework • revision d9111f6402 (6 days ago) • 2022-10-19 12:27:13 -0700
Engine • revision 3ad69d7be3
Tools • Dart 2.18.2 • DevTools 2.15.0
import 'package:cli/cli.dart' as cli;
import 'dart:core';
void main(List<String> arguments) {
  //print('Hello world: ${cli.calculate()}!');

  String myString = "";
  // make 200k records
  for (int x = 0; x < 200000; x++) {
    myString += "$x, another variable$x, 8\n";
    if (x % 1000 == 1) {
      print("$x");
    }
  }
}
blopker commented 1 year ago

In Dart, string concatenation is a bit slow. I'd suggest using StringBuffer to build larger strings:

import 'dart:core';

void main(List<String> arguments) {
  var myString = StringBuffer();
  // make 200k records
  for (int x = 0; x < 200000; x++) {
    myString.writeln("$x, another variable$x, 8");
    if (x % 1000 == 1) {
      print("$x");
    }
  }
  print(myString.length);
}

This takes less than a second on my machine.

bksubhuti commented 1 year ago

Yes.. correct. My friend also had another fix with some similar but different code. These numbers are more reasonable, but the question is .. is my original code out of the question for what can be practiced? I'm somewhat of a resurrected msvc dinosaur from the 90's, so I don't know much about modern programming.

Also.. I came about this problem while trying to parse a JSON file of 200k records. This is another problem. I wonder if the "+=" operator can be overloaded withe a write instead of writeln?

I think something should change.. 10 minutes versus a second .. yes.. that is a bug to me. At a minimum.. lint should pick it up and suggest.

blopker commented 1 year ago

I'm just a Dart user, not on the team or anything, but it's pretty common to use a StringBuffer in other languages when building large strings. I guess I personally don't have an issue using it in these cases, which I've only ran into once. It is interesting why JS is so much faster using the + operator though.

bksubhuti commented 1 year ago

a string and a buffer should be the same, but i guess it is not. My friend said that 200k objects are being made with this method. It also gets cumulatively slower and slower as more records (concatenations) are made. so each copy has more "string"

Levi-Lesches commented 1 year ago

My friend said that 200k objects are being made with this method.

Yep, your friend's right:

These numbers are more reasonable, but the question is .. is my original code out of the question for what can be practiced?

There is a lint that warns you not to use += for strings: use_string_buffers. It's not enabled by default though. I just opened an issue to ask to move it to the "on by default" lints: https://github.com/dart-lang/lints/issues/90. Funnily enough, this lint has code similar to yours as an example of what not to do:

String foo() {
  final buffer = '';
  for (int i = 0; i < 10; i++) {
    buffer += 'a'; // LINT
  }
  return buffer;
}

vs

String foo() {
  final buffer = StringBuffer();
  for (int i = 0; i < 10; i++) {
    buffer.write('a');
  }
  return buffer.toString();
}

I wonder if the "+=" operator can be overloaded with a write instead of writeln?

Currently, +=, ++, and all the self-mutating operators cannot be overridden. See dart-lang/language#2580.

Hopefully that helps

mraleph commented 1 year ago

JavaScript is actually a relative outlier that decided to optimise string concatenation because people don't understand that building long strings through repetitive concatenation of immutable strings has quadratic time complexity. Most languages don't do that.

Most JavaScript engines use complicated zoo of different string representations to optimise things like concatenation and slicing. This improves performance of the code like the one you have written but comes with a lot of complexity and comes with its own performance pitfalls. Dart decided that it is not worth going this route.

The solution, as explained, is to use StringBuffer.

@lrhn do we want to update String.operator + documentation to maybe suggest use of StringBuffer?

lrhn commented 1 year ago

The reason JavaScript is "faster" at string + is that they optimize specifically for using + to build a larger string incrementally. That also has a cost.

I worked on V8 years ago, and back then its internal string model was something like:

This means you can do string concatenation in constant time. Yey.

It also means that you cannot do most simple string operations efficiently on a concatenated string. For example doing indexOf cannot do a simple memory sweep over the string contents.

Because of that, a lot of string operations start by flattening their receiver, if it's a concat-string. Basically, a concat-string is a "lazy concatenation", which gets forced when you try to use the string. In that way, it works like an implicit StringBuffer like the one in Dart (or Java), just without the explicit control.

When the string has been flattened, the concat string object becomes (header, full string, empty string), so that existing references to the object keeps working, but there is only a small constant overhead when using the string. (If the string gets moved by garbage collection, it just becomes the flattened string without the trivial concat wrapper).

You can write badly performing code in JavaScript too, like:

// Fast code:
var t0 = Date.now();
var s = "";
for (var i = 0; i < 50000; i++) {
  s = s + "xyzzy";
  var c = "xyzzy".charCodeAt("xyzzy".length >> 1); // Simple constant time operation on a string.
}
var t1 = Date.now();
// Slow code:
var s2 = "";
for (var i = 0; i < 50000; i++) {
  s2 = s2 + "xyzzy";
  var c = s2.charCodeAt(s2.length >> 2); // Still simple, right?
}
var t2 = Date.now();
console.log(t1-t0, t2-t1); // For me: 18, 5342

By forcing the implicit flattening at every step, you get the same quadratic behavior that Dart's s += ... gives you in a loop.

Dart (native) chose to not do what JavaScript does here, because it wanted a more predictable performance (and a smaller overhead) on string operations, which then doesn't have to first check if the string is a concatenation or not.

Instead Dart followed Java's strategy and introduced a StringBuffer that you should use for non-trivial concatenations, because then you can control when the flattening happens.

When compiled to JavaScript, there is no way around the JavaScript behavior, but you are still recommended to use a StrignBuffer (which just does string concatenation inside, but on a string which doesn't leak until you ask for it.)

bksubhuti commented 1 year ago

Dart (native) chose to not do what JavaScript does here, because it wanted a more predictable performance (and a smaller overhead) on string operations, which then doesn't have to first check if the string is a concatenation or not.

Well, when we consider 10 minutes versus 1 second, then there is something to consider besides "pure" coding ethics. I do think += is quite readable and I use it often in my code. I was going to write.. "If the issue of lint is accepted, you can close this", but I evaluated how annoying it would be even for my own code. I don't like lint warning messages and I assume many others will feel the same. += will be lost, but it seems that the "experts" say it is better not to use it. I'm a monk and a C++ dinosaur who started programming after 20 years of doing almost nothing. I'm willing to accept what the experts say and even clean up my += code where such operations are insignificant. I'm not sure other people will like it.. similar to null safety, but it is better to use null safety. I appreciate it even though it rendered many tutorials (when I was learning) obsolete.

So you can close it.. I guess the experts will decide about lint string buffer. I appreciate this prompt detailed discussion on this issue and all of the hard work done on dart and flutter. It is a great language and framework and no wonder why it is #1.

The other (original) issue I had was reading and parsing a very large json using the fromJson method of the model. It too was very slow and seem to stop filling json arrays at near to 87,000 . 50k was fine. This whole exercise of creating a 200k csv was to make my own data was free from corruption problems due to font conversion (myanmar to roman, etc). I guess I'll create a json programatically and then see if size matters. However, maybe there is a reverse issue of buffers? I'm using the factory in the model by usual dart documentation way and it seems that factories do not really create objects and that is the point.

blopker commented 1 year ago

Yes! Thank you for the awesome explanations. I like that Dart has predictable performance, unlike JavaScript where there are a bunch of arcane performance tricks that change between runtimes and versions. It would be cool if += was lazy like JS, but I think the lint and more explicit docs would go a long way.

The docs actually do hint at when to use StringBuffer, but it's never said explicitly. Maybe this would be a good step forward. StringBuffer does say it's "A class for concatenating strings efficiently." However, it doesn't really say when those advantages show. Also, String does not show examples of when StringBuffer would be a better option. I could see putting the example from the use_string_buffers lint into one or both of these places.

I think that doc change with the lint would reduce the confusion a lot.