11ty / eleventy

A simpler site generator. Transforms a directory of templates (of varying types) into HTML.
https://www.11ty.dev/
MIT License
17.01k stars 492 forks source link

large data sets: 1.x has issues that 0.11.1, 0.12.1 do not #2226

Open SignpostMarv opened 2 years ago

SignpostMarv commented 2 years ago

Describe the bug I've a semi-open site generator project that squishes gigabytes of data sources down to about 6.9k pages to be processed by eleventy in two ways:

1) the legacy markdown repo 2) the more up-to-date json data source (pagination ftw!)

0.11.1 handles the 6.9k documents & 15.6mb json file without issue, 1.0.0 falls over in a similar fashion to that described in #695

To Reproduce The site generator is semi-open in that the source is available at https://github.com/Satisfactory-Clips-Archive/Media-Search-Archive, but it's not feasible to stash 2.7gb+ of source data into the repo, so the repro steps aren't readily reproduceable by anyone that doesn't have the data set.

While the method mentioned in #695 of specifying --max-old-space-size does move the goalposts somewhat, it still falls over with 8gb assigned.

Steps to reproduce the behaviour:

  1. npm run build or ./node_modules/.bin/elevent --config=./.eleventy.pages.js
  2. watch & wait

Expected behaviour 1.x to handle 6.9k markdown documents or 6.9k json data file entries as reliably as 0.11.x does

Screenshots

<--- Last few GCs --->

[16544:000001D55D359230]   168029 ms: Mark-sweep 4038.5 (4130.3) -> 4024.3 (4130.3) MB, 3347.8 / 0.0 ms  (average mu = 0.138, current mu = 0.017) task scavenge might not succeed
[16544:000001D55D359230]   171431 ms: Mark-sweep 4039.7 (4131.5) -> 4025.8 (4132.0) MB, 3345.7 / 0.0 ms  (average mu = 0.081, current mu = 0.016) task scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 00007FF70641DF0F v8::internal::CodeObjectRegistry::~CodeObjectRegistry+113567
 2: 00007FF7063AD736 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+67398
 3: 00007FF7063AE5ED node::OnFatalError+301
 4: 00007FF706DA0CAE v8::Isolate::ReportExternalAllocationLimitReached+94
 5: 00007FF706D8B2FD v8::Isolate::Exit+653
 6: 00007FF706C2EC5C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1468
 7: 00007FF706C3AC57 v8::internal::Heap::PublishPendingAllocations+1159
 8: 00007FF706C37C3A v8::internal::Heap::PageFlagsAreConsistent+2874
 9: 00007FF706C2B919 v8::internal::Heap::CollectGarbage+2153
10: 00007FF706BDC315 v8::internal::IndexGenerator::~IndexGenerator+22133
11: 00007FF70633F0AF X509_STORE_CTX_get_lookup_certs+4847
12: 00007FF70633DA46 v8::CFunctionInfo::HasOptions+16150
13: 00007FF70647C27B uv_async_send+331
14: 00007FF70647BA0C uv_loop_init+1292
15: 00007FF70647BBAA uv_run+202
16: 00007FF70644ABD5 node::SpinEventLoop+309
17: 00007FF706365BC3 v8::internal::UnoptimizedCompilationInfo::feedback_vector_spec+52419
18: 00007FF7063E3598 node::Start+232
19: 00007FF70620F88C CRYPTO_memcmp+342300
20: 00007FF707322AC8 v8::internal::compiler::RepresentationChanger::Uint32OverflowOperatorFor+14488
21: 00007FFB71217034 BaseThreadInitThunk+20
22: 00007FFB71402651 RtlUserThreadStart+33

Environment:

Additional context

pdehaan commented 2 years ago

Wow, that definitely wins for one of the larger sites/datasets I've seen in Eleventy!

You mentioned v0.11.1 and v1.0.0, but have you tried in v0.12.1 (which seems to be ~5 months newer than 0.11.x)? I'm curious if we can determine roughly where this may have changed/broke without having access to the ~2.7 GB of required data files.

npm info @11ty/eleventy time --json | grep -Ev "(canary|beta)" | tail -5

  "0.11.1": "2020-10-22T18:40:22.846Z",
  "0.12.0": "2021-03-19T19:24:27.860Z",
  "0.12.1": "2021-03-19T19:55:13.306Z",
  "1.0.0": "2022-01-08T20:27:32.789Z",
SignpostMarv commented 2 years ago

@pdehaan trying that now 👍

p.s. the data isn't exactly confidential, it's just more of a "I don't wanna have to spam up the git repo" thing :P

pdehaan commented 2 years ago

p.s. the data isn't exactly confidential, it's just more of a "I don't wanna have to spam up the git repo" thing :P

Oh, no worries. I totally don't want to download 2.7 GB of data unless… nope, I just really don't want to download roughly 1989 floppy disk's worth of data.

Although now I kind of want to add a "kb_to_floppy_disk" custom filter in Eleventy and represent all file sizes in relation to how many 3.5" floppy disks would be needed.

SignpostMarv commented 2 years ago

It's the subtitles and video pages for about 5.9k youtube videos. (not sure how I've got 1k more transcriptions than I have clips 🤷‍♂️)

You mentioned v0.11.1 and v1.0.0, but have you tried in v0.12.1

that completes as expected, although I haven't diffed the output to see if there are any changes/bugs etc.

pdehaan commented 2 years ago

that completes as expected, although I haven't diffed the output to see if there are any changes/bugs etc.

So, i think you're saying: ✔️ 0.11.1
✔️ 0.12.11.0.01.0.1-canary.3

Doubting this has already been fixed in 1.0.1-canary builds, but if you were looking to try the sharpest of cutting edge builds, you could try npm i @11ty/eleventy@canary. 🔪

npm info @11ty/eleventy dist-tags --json

{
  "latest": "1.0.0",
  "beta": "1.0.0-beta.10",
  "canary": "1.0.1-canary.3"
}
SignpostMarv commented 2 years ago
<--- Last few GCs --->

[16756:0000024C7C0F9B80]   144598 ms: Mark-sweep (reduce) 4067.7 (4143.4) -> 4067.1 (4143.9) MB, 7589.3 / 0.0 ms  (average mu = 0.141, current mu = 0.001) allocation failure scavenge might not succeed
[16756:0000024C7C0F9B80]   151443 ms: Mark-sweep (reduce) 4068.3 (4144.1) -> 4067.8 (4144.6) MB, 6831.9 / 0.1 ms  (average mu = 0.080, current mu = 0.002) allocation failure scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 00007FF70641DF0F v8::internal::CodeObjectRegistry::~CodeObjectRegistry+113567
 2: 00007FF7063AD736 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+67398
 3: 00007FF7063AE5ED node::OnFatalError+301
 4: 00007FF706DA0CAE v8::Isolate::ReportExternalAllocationLimitReached+94
 5: 00007FF706D8B2FD v8::Isolate::Exit+653
 6: 00007FF706C2EC5C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1468
 7: 00007FF706C2C151 v8::internal::Heap::CollectGarbage+4257
 8: 00007FF706C29AC0 v8::internal::Heap::AllocateExternalBackingStore+1904
 9: 00007FF706C464E0 v8::internal::FreeListManyCached::Reset+1408
10: 00007FF706C46B95 v8::internal::Factory::AllocateRaw+37
11: 00007FF706C5AB7A v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithFiller+90
12: 00007FF706C5AE63 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithMap+35
13: 00007FF706A689A6 v8::internal::HashTable<v8::internal::NameDictionary,v8::internal::NameDictionaryShape>::EnsureCapacity<v8::internal::Isolate>+246
14: 00007FF706A6E88E v8::internal::BaseNameDictionary<v8::internal::NameDictionary,v8::internal::NameDictionaryShape>::Add+110
15: 00007FF70697AE68 v8::internal::Runtime::GetObjectProperty+1624
16: 00007FF706E33281 v8::internal::SetupIsolateDelegate::SetupHeap+513585
17: 0000024C0028643A
$ ./node_modules/.bin/eleventy --version
1.0.1-canary.3
SignpostMarv commented 2 years ago

I totally don't want to download 2.7 GB of data unless…

@pdehaan the problematic json source is only 2.7mb gzipped (in case one wanted to produce a bare-minimum reproduceable case), although I suspect one could bulk generate random test data with for an array of objects this structure & it'd do the trick:

    {
        "id": "yt-0pKBBrBp9tM",
        "url": "https:\/\/youtu.be\/0pKBBrBp9tM",
        "date": "2022-02-15",
        "dateTitle": "February 15th, 2022 Livestream",
        "title": "State of Dave",
        "description": "00:00 Intro\n00:11 Presentation on Update 6\n01:23 Just simmering\n02:04 Recapping last week\n02:24 Hot Potato Save File\n04:53 Outro\n05:26 One more thing!",
        "topics": [
            "PLbjDnnBIxiEo8RlgfifC8OhLmJl8SgpJE"
        ],
        "other_parts": false,
        "is_replaced": false,
        "is_duplicate": false,
        "has_duplicates": false,
        "seealsos": false,
        "transcript": [
            /*
            this is an array of strings that could technically be structured objects but are generally only strings of
            single words up to full groups of paragraphs up, with this example having about 5-7kb of strings in total
            */
        ],
        "like_count": 7,
        "video_object": {
            "@context": "https:\/\/schema.org",
            "@type": "VideoObject",
            "name": "State of Dave",
            "description": "00:00 Intro\n00:11 Presentation on Update 6\n01:23 Just simmering\n02:04 Recapping last week\n02:24 Hot Potato Save File\n04:53 Outro\n05:26 One more thing!",
            "thumbnailUrl": "https:\/\/img.youtube.com\/vi\/BBrBp9tM\/hqdefault.jpg",
            "contentUrl": "https:\/\/youtu.be\/0pKBBrBp9tM",
            "url": [
                "https:\/\/youtu.be\/0pKBBrBp9tM",
                "https:\/\/archive.satisfactory.video\/transcriptions\/yt-0pKBBrBp9tM\/"
            ],
            "uploadDate": "2022-02-15"
        }
    }

p.s. this is the template that's in use in case it's a combination of size-of-data as well as the template: https://github.com/Satisfactory-Clips-Archive/Media-Search-Archive/blob/d5040ac3a42f8eca9517931812892d493b81d326/11ty/pages/transcriptions.njk, rather than just size-of-data

SignpostMarv commented 2 years ago

@pdehaan working on an isolated test case, have managed to trigger the bug in 0.12, going to check at what point 0.12 succeeds where 1.0 fails.

SignpostMarv commented 2 years ago

@pdehaan isolated test case currently fails on 0.11, 0.12, and 1.0 at about 21980 entries: https://github.com/SignpostMarv/11ty-eleventy-issue-2226

usage: git checkout ${branch} && npm install && node ./generate ${number} && ./node_modules/.bin/eleventy

the data & templates aren't as complex as those in the media-search-archive repo, will give a second pass at making this more complex if it's not useful enough to let you experiment with avoiding the heap out of memory issue?

SignpostMarv commented 2 years ago

test.json.gz p.s. because the generator is currently non-seeded, please find attached the gzipped test.json file that all three versions currently fail on

SignpostMarv commented 2 years ago

@pdehaan including the markdown repo as a source across all three versions definitely suggests it's either templating or data-related, rather than input-related, as all three versions can handle 7k of just straight-up markdown files. will amend further in the near future and keep you apprised.

SignpostMarv commented 2 years ago

@pdehaan bit of a delay with further investigation; Have started converting the runtime-generated data to pre-baked data, it looks like having the 131k line json data file in-memory causes the problems.

SignpostMarv commented 2 years ago

@pdehaan have updated the test-case repo that fails on 1.0 with 9k entries (node ./generate.js 9000) but runs on 0.11 and 0.12 without issue.

esheehan-gsl commented 2 years ago

I'm hitting this problem as well. I have a site that (only about 1,600 pages) that builds fine with Eleventy 0.12.0, but when I upgraded to 1.0.0 I get out of memory errors.

I've got a global data file (JS) that pulls data from a database (about 660 rows of data) and uses pagination to create one page for each entry from the database. If I shut the database off so that those pages don't get built, the build runs fine with 1.0.0.

I can work around the issue by increasing Node's max memory thus:

NODE_OPTIONS=--max_old_space_size=8192 npm run build

Not sure what happened with 1.0.0 that increased the memory usage this much (with pagination, or global data?) but it'd be great to get it back down.

pdehaan commented 2 years ago

/summon @zachleat Possible performance regression between 0.12.x and 1.0.

Thanks @SignpostMarv, I'll try fetching the ZIP file from https://github.com/11ty/eleventy/issues/2226#issuecomment-1059791740 and see if it will build on my laptop locally (disclaimer: it's a higher end M1 MacBook Pro, so results may differ).

@esheehan-gsl How complex is your content from your database? (is it Liquid or Markdown? etc) I've toyed with creating a "11ty lorem ipsum" blog generator in the past which just creates X pages based on some global variable so I can poke at performance issues like this w/ bigger sites. But sometimes it comes down to more of how many other filters and plugins and the general complexity of the site instead of just 600 pages vs 6000 pages (which can be frustrating).

esheehan-gsl commented 2 years ago

How complex is your content from your database? (is it Liquid or Markdown? etc)

There are quite a few fields coming from the database. There's probably over 30 fields coming from the database. Some of it is HTML, some of it is just metadata (paths to video files, categories) that get rendered on the page.

If it helps, it's used to build these pages: https://sos.noaa.gov/catalog/datasets/721/

SignpostMarv commented 2 years ago

Thanks @SignpostMarv, I'll try fetching the ZIP file from #2226 (comment) and see if it will build on my laptop locally (disclaimer: it's a higher end M1 MacBook Pro, so results may differ).

@pdehaan to clarify, the zip file isn't needed as the problem is replicable at a lower volume of generated pages (9k + supplemantary data) rather than the zip file's higher volume (21.9k w/ no supplementary data)

pdehaan commented 2 years ago

I created https://github.com/pdehaan/11ty-lorem which can generate 20k pages (in ~21s). If I bump it to around ~22k pages, it seems to hit memory issues (on Eleventy v1.0.0).

SignpostMarv commented 2 years ago

@pdehaan could you now grab the supplementary data file from my test repo (or generate something similar) and see how much lower you have to drop the page count?

zachleat commented 2 years ago

Howdy y’all, there are a few issues to organize here so just to keep things concise I am opening #2360 to coordinate this one. Please follow along there!

SignpostMarv commented 2 years ago

@zachleat tracking updates specific to the test repo here, rather than on new/open tickets:

80000

40000

as above, except for:

zachleat commented 2 years ago

What are the success conditions here? Is 80K the goal?

SignpostMarv commented 2 years ago

@zachleat was basing the test cases from your google spreadsheet, one assumes if it succeeds at 40k it'll succeed at the other sizes you found.

p.s. I'm not sure if the 80k "too many open files" thing should be counted as a new issue or a won't-fix?

SignpostMarv commented 2 years ago

success @ 50k + 55k + 59k + 59.9k + 59.92k + 59.925k + 59.928k + 59.929k, too many open files @ 60k + 59.99k + 59.95k + 59.93k

A couple things that I'm noticing:

adamlaki commented 1 year ago

Is there any progress here? I also have a bigger JSON source (5.6MB with 270k rows) that made circa 17k pages. On my local setup, I can build it with --max_old_space_size in ~5 minutes, but on Netlify, it breaks with the heap limit.

On another topic: do you have any tips on importing this amount without breaking? Is an external database a better idea?

Thank you!

SignpostMarv commented 1 year ago

On another topic: do you have any tips on importing this amount without breaking? Is an external database a better idea?

Thank you!

The most terrible option would be to duplicate templates & split the data up.

adamlaki commented 1 year ago

Yeah, that is something that came to my mind, too, but it will kill the pagination and the collection as a whole. It would be cool if we could break these files into smaller pieces and source them under the same collection or something similar.

For some reason, I could build it on Netlify without the error (maybe it needed time for the NODE_OPTIONS or it had a better day, I am not sure, unfortunately), but still complicated to plan knowing this problem. And my demo is quite plain, almost only data, with the biggest extra is an image inliner (SVG) shortcode for the icons.

Thank you for the feedback. I'll update if there's anything worthwhile.

SignpostMarv commented 1 year ago

Yeah, that is something that came to my mind, too, but it will kill the pagination and the collection as a whole. It would be cool if we could break these files into smaller pieces and source them under the same collection or something similar.

If you're referring to pagination links, one assumes that if you're taking steps to have data automatically split, you can have pagination values automatically generated "correctly"?

adamlaki commented 1 year ago

Breaking the source file beforehand could work for me if I could handle it as one collection at import. Still, much more editorial work to manage but at least no hacking at the template level. For the pagination (to connect two sources): I think you can offset the second source's pagination but still, you have two not related data groups with more administration and hacky solutions.

SignpostMarv commented 1 year ago

I've yet to revisit upgrades on mine since migrating away from the mixed markdown + json sources to eliminate the markdown source 🤔

Mcdostone commented 11 months ago

I'm facing a similar issue. I have a 1.9GB JSON file (src/_data/configs.json) containing an array of 591 494 objects.

---
pagination:
  data: configs
  size: 1
  alias: config
permalink: "{{ config.permalink }}"
eleventyComputed: {
  title: "{{ config.data.symbol }}"
}
---

Hello {{ config.data.symbol }}

Unfortunately, it doesn't generate any HTML output. after some debuging steps, I realized that this.ctx.configs is empty and I have no visible errors in the console. I tried to increase the heap size (--max_old_space_size=8192) but still.

I reduced the size of the src/_data/configs.json file and It turns out Eleventy works fine when the file size is below ~500MB.

operating system: MacOS ventura, M1 pro, 16GB Eleventy version: 2.0.1

d3v1an7 commented 2 months ago

Anyone landing here in 2024:

If using WebC:

  1. Check for nested webc:for (gets expensive quick)
  2. Switch from @html to @raw where possible (using @html once in the base layout seems to be sufficient!)

And/or:

  1. Just switch to v3! :)

I really didn't want to bump up RAM -- we're still just in the 1,000's of assets range (with relatively chunky objects). Switched to v3 and haven't bumped into RAM issues since. Also much faster:

v2.0.2-alpha.2: Wrote 6871 files in 264.93 seconds (38.6ms each)
v3.0.0-alpha.17: Wrote 6871 files in 178.43 seconds (26.0ms each)

After also removing the nested for and switching to raw, it's now around:

v3.0.0-alpha.17: Wrote 6871 files in 92.99 seconds (13.5ms each)
adamlaki commented 2 months ago

Hey @d3v1an7,

for me, it is still present, but somehow, Netlify pushes it through (19k pages), although the live output will be a bit buggy. On local, I use a different data set with fewer records.

It is good news that v3 could solve it; I plan to migrate in the future.

Thanks for the update!