faker-js / faker

Generate massive amounts of fake data in the browser and node.js
https://fakerjs.dev
Other
12.98k stars 919 forks source link

Improve DX for methods that use all memory #2695

Closed vgjenks closed 12 hours ago

vgjenks commented 8 months ago

Pre-Checks

Describe the bug

Using a big integer as min or max in faker.string.numeric in length options, is not handled gracefully and causes Javascript heap out of memory crash in V8. It should at least validate the inputs and provide a graceful exception message. Even better; provide big int support in options.

Minimal reproduction code

const { faker } = require("@faker-js/faker");

const fake = {
    bigRange: faker.string.numeric({ length: { min: 1000000000, max: 2000000000 }})
}

console.log("bigRange:", fake.bigRange);

Additional Context

No response

Environment Info

System:
  OS: macOS (Apple silicon)
Binaries:
  Node: 20.9.0
  npm: 10.1.10
npmPackages:
  @faker-js/faker: 8.4.1

Which module system do you use?

Used Package Manager

npm

ST-DDT commented 8 months ago

Using a big integer as min or max in faker.string.numeric in length options, is not handled gracefully and causes Javascript heap out of memory crash in V8. It should at least validate the inputs and provide a graceful exception message.

The inputs are valid for faker, but they hit constraints of the underlying engine.

Any suggestion how to handle that?

What if the memory limit is increased? What if the memory is already spend on different strings/other objects? What if you use a different engine?


Previously, we limited the size arbitrarily (2 ** 20), but this also results in freeze like delays (https://github.com/faker-js/faker/commit/6f977f6b0122dc1761152d4eb7922fed63dc87c0).

I have ideas how to make it 4-7 times faster without loosing much randomness, but generating these long strings (or arrays) still takes a long time. If we use repeated sub-strings, then we still could go faster (like ~100-1000times?), but we would still hit the limit of the runtime eventually. Unfortunately these vary between systems:

JS Spec: 2^53-1 In V8 (used by Chrome and Node), the maximum length is 2^29 - 24 (~1GiB). On 32-bit systems, the maximum length is 2^28 - 16 (~512MiB). In Firefox, the maximum length is 2^30 - 2 (~2GiB). Before Firefox 65, the maximum length was 2^28 - 1 (~512MiB). In Safari, the maximum length is 2^31 - 1 (~4GiB).

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length

We could limit the size again to like 2^25 to avoid the runtime limits:

In both scenarios we would limit the user arbritarly, even if their engine supports bigger values.

Just catching the out of memory error isn't an option either, because we would have to do that for every method then.

We could add a hint like If you generate more than 2 ^ 20 elements, you will likely run out of memory, but that feels very random for a documentation.


Even better; provide big int support in options.

Could you explain why that would be better or for what purpose do you need bigInt support?

vgjenks commented 8 months ago

Could you explain why that would be better or for what purpose do you need bigInt support?

Sorry, admittedly...I misunderstood the usage of these options and made an assumption that "length" + "min" and "max" would generate a string between those numbers. It's a little misleading at first glance, to be fair.

I actually came here to delete this report and submit another to tend to the fact that it causes you to run out of memory. I think that's the crux of the issue here - it should handle it better and have some sort of safe limit - or even a reasonable, safe default with a configurable limit to allow you to blow your foot off, should you choose.

My bad for not reading the doc closely enough and making an assumption - bad form. However, I was testing this within the context of the aws-sdk and it was frustrating to have it fail w/ zero feedback. I had to eliminate almost field-by-field to spot where it was happening, then test it in isolation to see it run out of memory and blow v8 up.

ST-DDT commented 8 months ago

Sorry, admittedly...I misunderstood the usage of these options and made an assumption that "length" + "min" and "max" would generate a string between those numbers. It's a little misleading at first glance, to be fair.

We had hoped that a link to the method generating a number would solve that issue.

Do you have a suggestion on how to improve this?


I actually came here to delete this report and submit another to tend to the fact that it causes you to run out of memory. I think that's the crux of the issue here - it should handle it better and have some sort of safe limit - or even a reasonable, safe default with a configurable limit to allow you to blow your foot off, should you choose.

Even if we limit it to 2^20, then you are still limited to ~1k of those strings on the heap. Less if we limit it to 2^25 etc. Having a configurable length limit might be solution against accidental misuse. But it would affect all methods, returning strings, arrays and objects. We will discuss this in the next team-meeting. It might take a while for it to get implemented, probably after #2667


My bad for not reading the doc closely enough and making an assumption - bad form. However, I was testing this within the context of the aws-sdk and it was frustrating to have it fail w/ zero feedback. I had to eliminate almost field-by-field to spot where it was happening, then test it in isolation to see it run out of memory and blow v8 up.

I understand how frustrating this can be. Sadly node does a very poor job, logging where the OOM error occurred. With a little stacktrace that would be easier.

For the future, I would like to recommend binary search for the error location or some kind of debug logging, so it is easier to detect where the code halts.


The issue was first about Crashes app suddenly without error. Was that actually no error message or did you run into a timeout first? I would like to understand whether we have to improve the implementation (for performance) or why there was no error message.

matthewmayer commented 8 months ago

"The number or range of digits to generate." is perhaps a little hard to read

Maybe

"The length of the generated string. Should either be a fixed length, or a min and max length.

ST-DDT commented 8 months ago

FFR: This needs a decision regarding introducing a configurable limit on data (string,array,...) length.

matthewmayer commented 8 months ago

I think the issue is that if a user passes faker.string.numeric({ length: { min: 1000000000, max: 2000000000 }}) or faker.string.numeric({ length: 1000000000}})

Do we think its more likely

How about, if a very large length is provided, we log a warning like:

> `faker.string.numeric({ length: { min: 1000000000, max: 2000000000 }})`
Warning: This will create a numeric string with length over 1 million which may be slow or cause out of memory errors. The length parameter controls the number of digits in the string, not the maximum value. To suppress this warning, pass `{allowLargeLength: true}`
ST-DDT commented 8 months ago

Team Decision

ST-DDT commented 3 weeks ago

I plan to tackle this soon.

  1. What should be the limit for the number of elements? 10k? 100K?

  2. Suggestions for the parameter name:

  1. Should this be a global config or a per method option?
function methodForBigData(options: { length: NumberOrRange, bypassSafety?: true})

faker.module.methodForBigData({ length: 1_000_000, bypassSafety: true });

// vs

faker.bypassSafety = true;
faker.module.methodForBigData({ length: 1_000_000 });
  1. Suggestion for the error message:

You are generating huge datasets, this is slow and might cause OutOfMemory issues. Requested: {}, Limit: {}. If you want to bypass this limit, please specify the ___ option.

or

Do you really want to generate datasets this huge? This is slow and might cause OutOfMemory issues. Requested: {}, Limit: {}. If you want to bypass this limit, please specify the ___ option.


Partially blocked by #3216

ST-DDT commented 12 hours ago

Team Decision

We will not add an allowHuge or similar flag. JS's string.repeat and other similar methods don't have an allowHuge flag either. The main performance impact is from the repeated nested method calls and not by the huge list itself.

We will try to improve the performance of methods generating strings though.