cca / equella_scripts

Collection of miscellaneous scripts for working with openEQUELLA
https://vault.cca.edu/
Other
0 stars 0 forks source link

`ENOMEM` HTTP error in ret.js #43

Open phette23 opened 2 months ago

phette23 commented 2 months ago

The initial retention script ret.js suffered memory problems, it would initiate a number of requests but eventually hit an ENOMEM error inside either node-fetch or the native node fetch interface:

node:internal/deps/undici/undici:13185
      Error.captureStackTrace(err);
            ^

TypeError: fetch failed
    at node:internal/deps/undici/undici:13185:13
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5) {
  [cause]: Error: connect ENOMEM 209.40.90.39:443 - Local (0.0.0.0:0)
      at internalConnect (node:net:1093:16)
      at defaultTriggerAsyncIdScope (node:internal/async_hooks:464:18)
      at GetAddrInfoReqWrap.emitLookup [as callback] (node:net:1492:9)
      at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:132:8) {
    errno: -12,
    code: 'ENOMEM',
    syscall: 'connect',
    address: '209.40.90.39',
    port: 443
  }
}

I tried a few approaches to solve this which kept reducing the memory usage but none fixed the error:

  1. an Item object consumes way more memory than the simple JSON API responses due to the parsed XML it contains, so do not map JSON to Items as you go but only in the final summarize function
  2. create a queue and track active requests, only allowing N active requests at a time (abandoned this approach entirely)
  3. try both an older (node 20.16.0) and newer (22.8.0) node version
  4. switch from node-fetch to node's native fetch implementation

Of all these, 1 and 4 made a noticeable difference, but the errors continued. Finally, I gave up on asynchronous code and rewrote the search function await the HTTP response and the parsing of the JSON response body. This means there's only one request at a time and node is better able to garbage collect prior response and data objects. The memory still spikes when Items are created but the script did complete successfully, so this issue is merely to track what I did.

If this problem recurs in the future, here are a couple more ideas:

  1. The number of items greater than 7 years old will only continue to grow; try processing them in date range chunks using not only modifiedBefore but modifiedAfter as well.
  2. Use the node command line flag --max-old-space-size to tell node to use more memory
  3. Split ret.js into a get.js script which simply streams JSON API data into an unprocessed file, then make ret.js stream through JSON items through Item to determine if they should be deaccessioned and write the final items.json file.
phette23 commented 3 weeks ago

Any process that makes repeated requests to VAULT is seeing similar errors (e.g. course_lists is too) and I think it has to do the with application. I haven't found a solution other than spacing our requests, which mitigates but does not eliminate the problem.