cpllab / lm-zoo

Easy black-box access to state-of-the-art language models
https://cpllab.github.io/lm-zoo/
MIT License
14 stars 6 forks source link

RNNG get-surprisals out of memory error #53

Closed RobertCChen closed 4 years ago

RobertCChen commented 4 years ago

I ran get-surprisals with RNNG for a text file with 55 lines and got an out of memory error. These lines were an average of 50 words long (range: 39–62) and contained 2.5 sentences on average (range: 1–5). The sentences were on average 13.3 words long (range: 3–52).

CPU memory allocation failed n=8055160832 align=32
terminate called after throwing an instance of 'dynet::out_of_memory'
  what():  CPU memory allocation failed
Aborted (core dumped)

I also tried using OpenMind, and it got a sigkill error for a smaller file--the metamorphosis.txt file. Might be related.

I opened an interactive session: srun --gres=gpu:1 -c 12 --mem=10G --time=2:00:00 --constraint=high-capacity --pty bash Ran the get-surprisals command: lm-zoo get-surprisals singularity:///om/group/cpl/lm-zoo/singularity/lmzoo-rnng.sif metamorphosis.txt Resulting in an error:

File "/home/robertcc/.local/lib/python3.7/site-packages/spython/utils/terminal.py", line 144, in stream_command
    raise subprocess.CalledProcessError(return_code, cmd)
subprocess.CalledProcessError: Command '['singularity', 'exec', '--bind', '/tmp/tmpvr71oauk:/host_stdin:ro', '/om/group/cpl/lm-zoo/singularity/lmzoo-rnng.sif', 'sh', '-c', 'cat /host_stdin | get_surprisals /dev/stdin', '2>/dev/null']' died with <Signals.SIGKILL: 9>.
hans commented 4 years ago

Hi @RobertCChen , after checking the RNNG codebase, I don't think there should be any intrinsic memory limit in the code itself. Have you tried running an Openmind job with a larger memory request? (I usually use 64GB as a default for RNNG evaluation runs.)

$ srun --mem 64G -t 1-0 -p cpl --pty bash

(-p cpl requests a job on the high-priority CPL queue; this ensures your job won't be pre-empted / canceled. Please don't use this by default, but it's helpful as a test here to ensure that your job isn't killed for other reasons on the cluster.)

RobertCChen commented 4 years ago

Thanks, @hans! Using a larger memory request worked.