axa-group / Parsr

Transforms PDF, Documents and Images into Enriched Structured Data
Apache License 2.0
5.81k stars 310 forks source link

Javascript heap out of memory #299

Closed matprec closed 4 years ago

matprec commented 4 years ago

Summary I'm trying to parse a rather large pdf file from intel (20 mb) and one or two steps after the table detection, i'm getting a javascript out of memory error, even though there is plenty of system memory available. Is there a way to increase the memory for the node process?

Steps To Reproduce Download intel manual, volume 2 Parse it via the web-interface with default settings

Expected behavior Not erroring because out of memory.

Actual behavior Error because v8 exits with oom.

Screenshots grafik

Environment

Latest Docker Image Ubuntu 18.04.03

Additional context This happens with both pdf.js and pdfminer.

marianorodriguez commented 4 years ago

@MSleepyPanda thanks for your bug report! Could you please copy-paste the full console log? This way we can see exactly where you are having this problem.

matprec commented 4 years ago

Sure!

Starting par.sr API : node api/server/dist/index.js
(node:6) [DEP0091] DeprecationWarning: crypto.DEFAULT_ENCODING is deprecated.
(node:6) [DEP0010] DeprecationWarning: crypto.createCredentials is deprecated. Use tls.createSecureContext instead.
(node:6) [DEP0011] DeprecationWarning: crypto.Credentials is deprecated. Use tls.SecureContext instead.
[2020-01-15T18:17:26] INFO  (parsr-api/6 on 1d65d30b2e47): Api listening on port 3001!
[2020-01-15T18:18:08] INFO  (parsr-api/6 on 1d65d30b2e47): Returning the default server settings...
[2020-01-15T18:18:19] INFO  (parsr-api/6 on 1d65d30b2e47): Processing /tmp/97f9835da126ec7491a54e790ab94d/7154e025b3ba906b4e54f08db2fee7.pdf
[2020-01-15T18:18:19] INFO  (parsr-api/6 on 1d65d30b2e47): node ../../dist/bin/index.js --input-file /tmp/97f9835da126ec7491a54e790ab94d/7154e025b3ba906b4e54f08db2fee7.pdf --output-folder /opt/app-root/src/api/server/dist/output/325383-sdm-vol-2abcd-f74d4a2c0804ecbf90e85e34a433ff --document-name 325383-sdm-vol-2abcd --config /tmp/5b3f3d61c7c992db181e36533c92a9/0d796837acbb1687b36e140c2767a4.blob
[2020-01-15T18:18:25] INFO  (parsr-api/6 on 1d65d30b2e47): No info found about the current version
[2020-01-15T18:18:25] INFO  (parsr-api/6 on 1d65d30b2e47): Using config:
[2020-01-15T18:18:25] INFO  (parsr-api/6 on 1d65d30b2e47): Config {
  version: 0.5,
  cleaner: [
    'out-of-page-removal',
    [
      'whitespace-removal',
      [Object]
    ],
    [
      'redundancy-detection',
      [Object]
    ],
    [
      'table-detection',
      [Object]
    ],
    [
      'header-footer-detection',
      [Object]
    ],
    [
      'reading-order-detection',
      [Object]
    ],
    'link-detection',
    'image-detection',
    [
      'words-to-line',
      [Object]
    ],
    [
      'lines-to-paragraph',
      [Object]
    ],
    'heading-detection',
    'heading-detection-dt',
    'list-detection',
    'page-number-detection',
    'hierarchy-detection',
    [
      'regex-matcher',
      [Object]
    ]
  ],
  extractor: {
    pdf: 'pdfminer',
    img: 'tesseract',
    language: [
      'eng',
      'fra'
    ]
  },
  output: {
    granularity: 'word',
    includeMarginals: false,
    formats: {
      json: true,
      text: true,
      csv: true,
      markdown: true,
      pdf: false
    }
  }
}
[2020-01-15T18:18:25] INFO  (parsr-api/6 on 1d65d30b2e47): Using extractor: PdfminerExtractor
[2020-01-15T18:20:16] INFO  (parsr-api/6 on 1d65d30b2e47): qpdf repair successfully performed on file /tmp/97f9835da126ec7491a54e790ab94d/7154e025b3ba906b4e54f08db2fee7.pdf. New file at: /tmp/db002023b148553e56b2c821277e1f.pdf
[2020-01-15T18:20:20] INFO  (parsr-api/6 on 1d65d30b2e47): mupdf cleaning successfully performed on file /tmp/db002023b148553e56b2c821277e1f.pdf. Resulting file: /tmp/bec81398c78b4723627bda2b515326.pdf
[2020-01-15T18:20:20] INFO  (parsr-api/6 on 1d65d30b2e47): Extracting file contents with pdfminer's pdf2txt.py tool...
[2020-01-15T18:22:12] INFO  (parsr-api/6 on 1d65d30b2e47): Returning the default server settings...
[2020-01-15T18:35:37] INFO  (parsr-api/6 on 1d65d30b2e47): Extracting images and fonts to /tmp/3183c78637aef50a0aefc4a3170f17
[2020-01-15T18:37:46] INFO  (parsr-api/6 on 1d65d30b2e47): <--- Last few GCs --->
[2020-01-15T18:37:46] INFO  (parsr-api/6 on 1d65d30b2e47): [17:0x1f1e400]  1166291 ms: Mark-sweep 1398.5 (1404.4) -> 1398.1 (1404.9) MB, 251.6 / 0.0 ms  (average mu = 0.107, current mu = 0.035) allocation failure scavenge might not succeed
[2020-01-15T18:37:46] INFO  (parsr-api/6 on 1d65d30b2e47): [17:0x1f1e400]  1166553 ms: Mark-sweep 1398.8 (1404.9) -> 1398.3 (1404.9) MB, 254.6 / 0.0 ms  (average mu = 0.075, current mu = 0.030) allocation failure scavenge might not succeed
[2020-01-15T18:37:46] INFO  (parsr-api/6 on 1d65d30b2e47): <--- JS stacktrace --->
[2020-01-15T18:37:46] INFO  (parsr-api/6 on 1d65d30b2e47): ==== JS stack trace =========================================
[2020-01-15T18:37:46] INFO  (parsr-api/6 on 1d65d30b2e47):     0: ExitFrame [pc: 0x10bd78a5452b]
[2020-01-15T18:37:46] INFO  (parsr-api/6 on 1d65d30b2e47):     1: StubFrame [pc: 0x10bd78a79841]
[2020-01-15T18:37:46] INFO  (parsr-api/6 on 1d65d30b2e47): Security context: 0x31f63e7aede9 <JSObject>
[2020-01-15T18:37:46] INFO  (parsr-api/6 on 1d65d30b2e47):     2: parse [0x10d19f9a0c61] [/opt/app-root/src/node_modules/xmldom/sax.js:111] [bytecode=0x1e292de359e9 offset=490](this=0x30f3cbc884f9 <JSGlobal Object>,source=0x15ab80b82201 <Very long string[625153970]>,defaultNSMapCopy=0x282c6cc73181 <Object map = 0x211de74c1341>,entityMap=0x282c6cc732e1 <Object map = 0x211de74c4ba...
[2020-01-15T18:37:46] ERROR (parsr-api/6 on 1d65d30b2e47): FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

[2020-01-15T18:37:46] ERROR (parsr-api/6 on 1d65d30b2e47):  1: 0x7f139475d948 node::Abort() [/usr/lib/x86_64-linux-gnu/libnode.so.64]
 2: 0x7f139475d991  [/usr/lib/x86_64-linux-gnu/libnode.so.64]
 3: 0x7f139491ef92 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T18:37:46] ERROR (parsr-api/6 on 1d65d30b2e47):  4: 0x7f139491f1e8 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T18:37:46] ERROR (parsr-api/6 on 1d65d30b2e47):  5: 0x7f1394ca0722  [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T18:37:46] ERROR (parsr-api/6 on 1d65d30b2e47):  6: 0x7f1394cb1173 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/usr/lib/x86_64-linux-gnu/libnode.so.64]
 7: 0x7f1394cb1a66 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/lib/x86_64-linux-gnu/libnode.so.64]
 8: 0x7f1394cb41ad v8::internal::Heap::AllocateRawWithLigthRetry(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [/usr/lib/x86_64-linux-gnu/libnode.so.64]
 9: 0x7f1394cb4202 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T18:37:46] ERROR (parsr-api/6 on 1d65d30b2e47): 10: 0x7f1394c82b64 v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationSpace) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T18:37:46] ERROR (parsr-api/6 on 1d65d30b2e47): 11: 0x7f1394eecbc5 v8::internal::Runtime_AllocateInNewSpace(int, v8::internal::Object**, v8::internal::Isolate*) [/usr/lib/x86_64-linux-gnu/libnode.so.64]
12: 0x10bd78a5452b 

[2020-01-15T18:37:47] INFO  (parsr-api/6 on 1d65d30b2e47): Process exited
matprec commented 4 years ago

I've manually edited the docker image to give the node process 8GB ram, but it just churns thru it in the image detection pass, eventually failing. I'm currently trying a second run with the pass disabled.

matprec commented 4 years ago

The Problem persists, with increased and the image detection pass disabled via the gui.

[2020-01-15T19:32:17] INFO  (parsr-api/6 on fd2c879a0329): Processing /tmp/4a63c5fe718c16bced2db194dfc77d/0d6d72a5ec7f27f9f05c3e7569ea0a.pdf
[2020-01-15T19:32:17] INFO  (parsr-api/6 on fd2c879a0329): node --max-old-space-size=8192 ../../dist/bin/index.js --input-file /tmp/4a63c5fe718c16bced2db194dfc77d/0d6d72a5ec7f27f9f05c3e7569ea0a.pdf --output-folder /opt/app-root/src/api/server/dist/output/325383-sdm-vol-2abcd-4990aa3d43adf041a14bb7e26c4669 --document-name 325383-sdm-vol-2abcd --config /tmp/70f0cf3c0fe409eeddcb5fd29f8076/48614d41c4b1241158b06337a2699b.blob
[2020-01-15T19:32:19] INFO  (parsr-api/6 on fd2c879a0329): No info found about the current version
[2020-01-15T19:32:19] INFO  (parsr-api/6 on fd2c879a0329): Using config:
[2020-01-15T19:32:19] INFO  (parsr-api/6 on fd2c879a0329): Config {
  version: 0.5,
  cleaner: [
    'out-of-page-removal',
    [
      'whitespace-removal',
      [Object]
    ],
    [
      'redundancy-detection',
      [Object]
    ],
    [
      'table-detection',
      [Object]
    ],
    [
      'header-footer-detection',
      [Object]
    ],
    [
      'reading-order-detection',
      [Object]
    ],
    'link-detection',
    [
      'words-to-line',
      [Object]
    ],
    [
      'lines-to-paragraph',
      [Object]
    ],
    'heading-detection',
    'heading-detection-dt',
    'list-detection',
    'page-number-detection',
    'hierarchy-detection',
    [
      'regex-matcher',
      [Object]
    ]
  ],
  extractor: {
    pdf: 'pdfminer',
    img: 'tesseract',
    language: [
      'eng',
      'fra'
    ]
  },
  output: {
    granularity: 'word',
    includeMarginals: false,
    formats: {
      json: true,
      text: true,
      csv: true,
      markdown: true,
      pdf: false
    }
  }
}
[2020-01-15T19:32:19] INFO  (parsr-api/6 on fd2c879a0329): Using extractor: PdfminerExtractor
[2020-01-15T19:33:06] INFO  (parsr-api/6 on fd2c879a0329): qpdf repair successfully performed on file /tmp/4a63c5fe718c16bced2db194dfc77d/0d6d72a5ec7f27f9f05c3e7569ea0a.pdf. New file at: /tmp/9bf4c9c59edb8b42d29ad67b842231.pdf
[2020-01-15T19:33:07] INFO  (parsr-api/6 on fd2c879a0329): mupdf cleaning successfully performed on file /tmp/9bf4c9c59edb8b42d29ad67b842231.pdf. Resulting file: /tmp/4a22c5c207311e48f2f8d2c6f22a69.pdf
[2020-01-15T19:33:07] INFO  (parsr-api/6 on fd2c879a0329): Extracting file contents with pdfminer's pdf2txt.py tool...
[2020-01-15T19:40:04] INFO  (parsr-api/6 on fd2c879a0329): Extracting images and fonts to /tmp/409bd47c5b970b53e6cd3c2f351798
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329): <--- Last few GCs --->
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329): [34:0xe5f730]   660554 ms: Scavenge 7528.8 (8324.4) -> 7523.3 (8331.4) MB, 32.1 / 0.0 ms  (average mu = 0.155, current mu = 0.095) allocation failure 
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329): [34:0xe5f730]   665798 ms: Mark-sweep 7534.6 (8331.4) -> 7524.3 (8322.9) MB, 5189.8 / 0.0 ms  (average mu = 0.099, current mu = 0.039) allocation failure scavenge might not succeed
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329): [34:0xe5f730]   665881 ms: Scavenge 7536.3 (8322.9) -> 7530.1 (8328.9) MB, 34.8 / 0.0 ms  (average mu = 0.099, current mu = 0.039) allocation failure 
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329): <--- JS stacktrace --->
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329): ==== JS stack trace =========================================
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329):     0: ExitFrame [pc: 0x3a3d2b05452b]
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329): Security context: 0x13e1adaaede9 <JSObject>
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329):     1: appendElement [0x227b38b25329] [/opt/app-root/src/node_modules/xmldom/sax.js:~388] [pc=0x3a3d2b169307](this=0x1a3ae95084f9 <JSGlobal Object>,el=0x01db84804179 <ElementAttributes map = 0x1417e3b10171>,domBuilder=0x03d766b97031 <DOMHandler map = 0x1417e3b0fa39>,currentNSMap=0x03d766b970c9 <Object map = 0x1417e3b0e5f1>)
[2020-01-15T19:43:28] INFO  (parsr-api/6 on fd2c879a0329):     2: parse [0x227b38b25269] [/op...
[2020-01-15T19:43:28] ERROR (parsr-api/6 on fd2c879a0329): FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0x7f07a0a02948 node::Abort() [/usr/lib/x86_64-linux-gnu/libnode.so.64]
 2: 0x7f07a0a02991  [/usr/lib/x86_64-linux-gnu/libnode.so.64]
 3: 0x7f07a0bc3f92 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T19:43:28] ERROR (parsr-api/6 on fd2c879a0329):  4: 0x7f07a0bc41e8 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T19:43:28] ERROR (parsr-api/6 on fd2c879a0329):  5: 0x7f07a0f45722  [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T19:43:28] ERROR (parsr-api/6 on fd2c879a0329):  6: 0x7f07a0f56173 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T19:43:28] ERROR (parsr-api/6 on fd2c879a0329):  7: 0x7f07a0f56a66 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T19:43:28] ERROR (parsr-api/6 on fd2c879a0329):  8: 0x7f07a0f591ad v8::internal::Heap::AllocateRawWithLigthRetry(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T19:43:28] ERROR (parsr-api/6 on fd2c879a0329):  9: 0x7f07a0f59202 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationSpace, v8::internal::AllocationAlignment) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T19:43:28] ERROR (parsr-api/6 on fd2c879a0329): 10: 0x7f07a0f27b64 v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationSpace) [/usr/lib/x86_64-linux-gnu/libnode.so.64]

[2020-01-15T19:43:28] ERROR (parsr-api/6 on fd2c879a0329): 11: 0x7f07a1191bc5 v8::internal::Runtime_AllocateInNewSpace(int, v8::internal::Object**, v8::internal::Isolate*) [/usr/lib/x86_64-linux-gnu/libnode.so.64]
12: 0x3a3d2b05452b 

[2020-01-15T19:43:29] INFO  (parsr-api/6 on fd2c879a0329): Process exited
bjadel commented 4 years ago

I have the same problem. Could fix it temporarily. At the first start of the container I set a system variable:

docker run -p 3001:3001 -e NODE_OPTIONS=--max_old_space_size=8192 axarev/parsr

matprec commented 4 years ago

Unfortunately this doesn't suffice in my case, the pass is simply to memory hungry, even when disabled.

jvalls-axa commented 4 years ago

Hi @MSleepyPanda your issue is related to output file size generated by PdfMiner/Pdf.js when extracting data.

We added it in our backlog so in best case I guess it will be fixed for next release at 31 January.

Thanks