ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Timeouts and other errors while getting metadata in a large query to EPMC #117

Open solstag opened 8 years ago

solstag commented 8 years ago

These are errors I'm getting. I am not downloading papers, just getting the metadata. I am running from a 12 core server with 40GB RAM connected to the academic network of Paris. I am running these queries through SSH from my computer at home and that connection remains fine throughout them.

JavaScript heap out of memory, 35k results

$ node node_modules/getpapers/bin/getpapers.js -a -o gp-breast_cancer-research_articles-en-1991 -q '(ABSTRACT:"breast cancer") AND (PUB_TYPE:"Research-article") AND (LANG:"eng" OR LANG:"en" OR LANG:"us") AND (FIRST_PDATE:[1991-01-01 TO 2020-12-31])'
info: Searching using eupmc API
info: Found 35437 results
Retrieving results [===================-----------] 62% (eta 1860.5s)
<--- Last few GCs --->

 3039442 ms: Mark-sweep 1373.8 (1434.6) -> 1371.2 (1433.7) MB, 2196.0 / 0 ms [allocation failure] [GC in old space requested].
 3041776 ms: Mark-sweep 1371.2 (1433.7) -> 1371.1 (1433.7) MB, 2333.4 / 0 ms [allocation failure] [GC in old space requested].
 3044068 ms: Mark-sweep 1371.1 (1433.7) -> 1371.0 (1433.7) MB, 2292.5 / 0 ms [last resort gc].
 3046376 ms: Mark-sweep 1371.0 (1433.7) -> 1370.8 (1433.7) MB, 2307.9 / 0 ms [last resort gc].

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x33c025fc9e31 <JS Object>
    1: slowToString [buffer.js:~426] [pc=0x19e31b49be1e] (this=0x133cb0388031 <an Uint8Array with map 0x3c63cb6ad099>,encoding=0x33c025f04189 <undefined>,start=0x33c025f04189 <undefined>,end=0x33c025f04189 <undefined>)
    2: arguments adaptor frame: 1->3
    3: _encode [/srv/lisis-lab/devroot/home/ale/node_modules/restler/lib/restler.js:~191] [pc=0x19e31b74237e] (this=0x133cb0359641 <a Request...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0xfc527c [node]
 3: v8::Utils::ReportApiFailure(char const*, char const*) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [node]
 6: v8::internal::Factory::NewStringFromUtf8(v8::internal::Vector<char const>, v8::internal::PretenureFlag) [node]
 7: v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::String::NewStringType, int) [node]
 8: node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding) [node]
 9: void node::Buffer::StringSlice<(node::encoding)1>(v8::FunctionCallbackInfo<v8::Value> const&) [node]
10: 0x19e31b45e968
Abandon

JavaScript heap out of memory, 10k results

$ node node_modules/getpapers/bin/getpapers.js -a -o gp-breast_cancer-research_articles-en-2001-2011 -q '(ABSTRACT:"breast cancer") AND (PUB_TYPE:"Research-article") AND (LANG:"eng" OR LANG:"en" OR LANG:"us") AND (FIRST_PDATE:[2001-01-01 TO 2011-01-01])'
info: Searching using eupmc API
info: Found 10488 results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata

<--- Last few GCs --->

 1714336 ms: Mark-sweep 1048.7 (1106.8) -> 1043.0 (1108.8) MB, 1038.2 / 0 ms [allocation failure] [GC in old space requested].
 1715469 ms: Mark-sweep 1043.0 (1108.8) -> 1043.0 (1109.8) MB, 1132.7 / 0 ms [allocation failure] [GC in old space requested].
 1716628 ms: Mark-sweep 1043.0 (1109.8) -> 1042.7 (1107.8) MB, 1159.0 / 0 ms [last resort gc].
 1717731 ms: Mark-sweep 1042.7 (1107.8) -> 1042.7 (1107.8) MB, 1102.5 / 0 ms [last resort gc].

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x28eef17c9e31 <JS Object>
    1: SparseJoinWithSeparatorJS(aka SparseJoinWithSeparatorJS) [native array.js:~84] [pc=0x1d83528fcf37] (this=0x28eef1704189 <undefined>,w=0x9a6465a0381 <JS Array[10488]>,L=10488,M=0x28eef17b49e9 <JS Function ConvertToString (SharedFunctionInfo 0x28eef174ef79)>,N=0x18cc4961e29 <String[4]\: ,\n  >)
    2: Join(aka Join) [native array.js:143] [pc=0x1d835290c256] (this=0x28eef1704189 <undefined>...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0xfc527c [node]
 3: v8::Utils::ReportApiFailure(char const*, char const*) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [node]
 6: v8::internal::Runtime_SparseJoinWithSeparator(int, v8::internal::Object**, v8::internal::Isolate*) [node]
 7: 0x1d835240961b
Abandon

Timeout

$ node node_modules/getpapers/bin/getpapers.js -a -o gp-breast_cancer-research_articles-en-2011-2016 -q '(ABSTRACT:"breast cancer") AND (PUB_TYPE:"Research-article") AND (LANG:"eng" OR LANG:"en" OR LANG:"us") AND (FIRST_PDATE:[2011-01-01 TO 2016-01-01])'
info: Searching using eupmc API
info: Found 22325 results
Retrieving results [============------------------] 40% (eta 1997.1s)error: Did not get a response from Europe PMC within 20000ms

Empty query

$ node node_modules/getpapers/bin/getpapers.js -a -o gp-breast_cancer-research_articles-en-2011-2014 -q '(ABSTRACT:"breast cancer") AND (PUB_TYPE:"Research-article") AND (LANG:"eng" OR LANG:"en" OR LANG:"us") AND (FIRST_PDATE:[2011-01-01 TO 2014-01-01])'
info: Searching using eupmc API
info: Found 12463 results
Retrieving results [========================------] 80% (eta 425.8s)error: Malformed or empty response from EuropePMC. Try running again. Perhaps your query is wrong.
sedimentation-fault commented 7 years ago

Me too encounters the

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

error, with the query:

getpapers --api 'arxiv' --query "cat:math.MP" --outdir "math.MP" -p

during the last step of fetching results:

info: Searching using arxiv API
info: Found 49619 results
Retrieving results [=====-------------------------] 16% (eta 1101.6s)error: Malformed response from arXiv API - no data in feed
info: Retrying failed request
Retrieving results [======------------------------] 20% (eta 1097.4s)error: Malformed response from arXiv API - no data in feed
info: Retrying failed request
Retrieving results [=======-----------------------] 23% (eta 1108.8s)error: Malformed response from arXiv API - no data in feed
info: Retrying failed request
...
Retrieving results [==============================] 100% (eta 4.2s)error: Malformed response from arXiv API - no data in feed
info: Retrying failed request
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata

<--- Last few GCs --->

 3320779 ms: Mark-sweep 670.2 (716.6) -> 670.2 (716.6) MB, 7699.5 / 0.0 ms [allocation failure] [scavenge might not succeed].
 3326769 ms: Mark-sweep 670.2 (716.6) -> 670.2 (716.6) MB, 5990.1 / 0.0 ms [allocation failure] [scavenge might not succeed].
 3334178 ms: Mark-sweep 670.2 (716.6) -> 672.2 (708.6) MB, 7409.1 / 0.0 ms [last resort gc].
 3341032 ms: Mark-sweep 672.2 (708.6) -> 674.2 (708.6) MB, 6852.8 / 0.0 ms [last resort gc].

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x5cf7b815 <JS Object>
    1: JSONSerialize(aka JSONSerialize) [native json.js:~141] [pc=0x294c0904] (this=0x5cf081d9 <undefined>,Q=0x5f1f9381 <String[1]: 0>,u=0x9d5dadc9 <JS Array[1]>,F=0x5cf08101 <null>,G=0x8c95bfa1 <a Stack with map 0xb3309995>,H=0x2d23084d <String[6]:       >,I=0x8c95bf91 <String[2]:   >)
    2: SerializeArray(aka SerializeArray) [native json.js:~69] [pc=0x294c2233] (this=0x5cf081d9 <undefined>,E=0x9...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0x80b701ef [node]
 3: v8::Utils::ReportApiFailure(char const*, char const*) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Heap::FatalProcessOutOfMemory(char const*, bool) [node]
 6: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [node]
 7: v8::internal::Runtime_QuoteJSONString(int, v8::internal::Object**, v8::internal::Isolate*) [node]
...
49: 0x806c2e90 [node]
50: v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*) [node]
51: v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) [node]
52: v8::Function::Call(v8::Local<v8::Value>, int, v8::Local<v8::Value>*) [node]
53: node::AsyncWrap::MakeCallback(v8::Local<v8::Function>, int, v8::Local<v8::Value>*) [node]
54: node::StreamBase::EmitData(int, v8::Local<v8::Object>, v8::Local<v8::Object>) [node]
55: node::StreamWrap::OnReadImpl(int, uv_buf_t const*, uv_handle_type, void*) [node]
56: node::StreamWrap::OnReadCommon(uv_stream_s*, int, uv_buf_t const*, uv_handle_type) [node]
57: node::StreamWrap::OnRead(uv_stream_s*, int, uv_buf_t const*) [node]
58: 0xb7655503 [/usr/lib/libuv.so.1]
59: 0xb7655f1a [/usr/lib/libuv.so.1]
60: uv__io_poll [/usr/lib/libuv.so.1]
61: uv_run [/usr/lib/libuv.so.1]
62: node::Start(int, char**) [node]
63: main [node]
64: __libc_start_main [/lib/libc.so.6]
65: 0x803582d3 [node]
Cancelled

Out-of-memory error for just 50,000 results? C'mon...

sedimentation-fault commented 7 years ago

Solution

You have to pass the --max_old_space_size option (and possibly others...) to node. To this end, I tried changing

#!/usr/bin/env node

to

#!/usr/bin/env node --max_old_space_size=896

in the top of

/usr/bin/getpapers

but this just causes getpapers to wait indefinitely after invocation - and nothing happens!

The only thing that worked was

  1. passing those options to node on the command line and
  2. increasing the number from 896 to 1400

That is, you have to use the following magic invocation for your large query:

node --max_old_space_size=1400 --optimize_for_size --max_executable_size=1400 --stack_size=1400 /usr/bin/getpapers --api 'arxiv' --query "cat:math.MP" --outdir "math.MP" -p

:warning: You MUST use the full path to getpapers!

:warning: Using a too large value (say, 2000, instead of 1400) on a 32-bit system will cause nothing but a segmentation-fault - you have therefore to fine-tune it on your system. On 64-bit systems, larger values like 4096 may be perfectly adequate.

:red_circle: It would be very desirable to have a way to pass those options more comfortably, rather than having to type them on the command-line and change the invocation from 'getpapers ....'to 'node .... /usr/bin/getpapers ...'.