caracal-pipeline / crystalball

Distributed prediction of visibilities from a sky model
GNU General Public License v2.0
2 stars 5 forks source link

memory usage oveflow #37

Closed mpatir closed 4 years ago

mpatir commented 4 years ago

Description

Trying to transfer the sky model to my visibilities. Crystalball crashed after about 17 hours processing. It appears it keeps increasing the memory usage despite requesting only 50% of my total ram. It does this until it reaches the max ram capabilities of my computer system (see plot).

crystalball_memory_usage

What I Did

Command '['crystalball', '--sky-model', '/stimela_mount/output/continuum/image_5/mypipelinerun_ESO137_001_5-sources.txt', '--output-column', 'MODEL_DATA', '--row-chunks', '0', '--model-chunks', '0', '--num-sources', '0', '--num-workers', '0', '--memory-fraction', '0.5', '/stimela_mount/msdir/Sub1557347448_sdp_l0-ESO137_001_corr.ms']

Taceback: (not full) MemoryError: Unable to allocate 1.80 GiB for an array with shape (1, 300222, 402, 1) and data type complex128 Traceback (most recent call last): File "/stimela_mount/code/run.py", line 40, in subprocess.check_call(shlex.split(_runc)) File "/usr/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['crystalball', '--sky-model', '/stimela_mount/output/continuum/image_5/mypipelinerun_ESO137_001_5-sources.txt', '--output-column', 'MODEL_DATA', '--row-chunks', '0', '--model-chunks', '0', '--num-sources', '0', '--num-workers', '0', '--memory-fraction', '0.5', '/stimela_mount/msdir/Sub1557347448_sdp_l0-ESO137_001_corr.ms']' returned non-zero exit status 1. docker returns error code 1 job failed at 2020-05-21 04:26:36.481523 after 17:43:30.590694 Traceback (most recent call last):

sjperkins commented 4 years ago

@mpatir Can you run a pip freeze inside the caracal crystallball container and post the output here?

mpatir commented 4 years ago

@sjperkins "caracal crystallball container" -- is this just the normal container? or does crystalball use other container tech?.

I am running caracal inside docker.

below is what I get when I run pip freeze.
APLpy==2.0.3 appdirs==1.4.4 astropy==4.0.1.post1 astropy-healpix==0.5 astroquery==0.4 attrs==19.3.0 backcall==0.1.0 beautifulsoup4==4.9.1 bleach==3.1.5 bokeh==2.0.2 caracal @ file:///14TB/mramatsoku/Software/caracal certifi==2020.4.5.1 cffi==1.14.0 chardet==3.0.4 cryptography==2.9.2 cycler==0.10.0 Cython==0.29.18 decorator==4.4.2 defusedxml==0.6.0 docopt==0.6.2 entrypoints==0.3 future-fstrings==1.2.0 html5lib==1.0.1 idna==2.9 imagecodecs==2020.2.18 imageio==2.8.0 importlib-metadata==1.6.0 ipdb==0.13.2 ipython==7.14.0 ipython-genutils==0.2.0 jedi==0.17.0 jeepney==0.4.3 Jinja2==2.11.2 jsonschema==3.2.0 jupyter-core==4.6.3 keyring==21.2.1 kiwisolver==1.2.0 MarkupSafe==1.1.1 matplotlib==3.2.1 mistune==0.8.4 nbconvert==5.6.1 nbformat==5.0.6 networkx==2.4 nose==1.3.7 npyscreen==4.10.5 numpy==1.18.4 packaging==20.4 pandas==1.0.3 pandocfilters==1.4.2 parso==0.7.0 pexpect==4.8.0 pickleshare==0.7.5 Pillow==7.1.2 pkg-resources==0.0.0 pooch==1.1.0 progressbar2==3.51.3 prompt-toolkit==3.0.5 ptyprocess==0.6.0 PyAVM==0.9.4 pycparser==2.20 Pygments==2.6.1 pykwalify==1.7.0 pyparsing==2.4.7 pyregion==2.0 pyrsistent==0.16.0 pysolr==3.9.0 python-dateutil==2.8.1 python-utils==2.4.0 pytz==2020.1 PyWavelets==1.1.1 PyYAML==5.3.1 radiopadre-client==1.0rc10 regions==0.4 reproject==0.7 requests==2.23.0 ruamel.yaml==0.16.10 ruamel.yaml.clib==0.2.0 scikit-image==0.17.2 scipy==1.4.1 SecretStorage==3.1.2 Shapely==1.7.0 six==1.14.0 soupsieve==2.0.1 stimela==1.5.4 testpath==0.4.4 tifffile==2020.5.11 tornado==6.0.4 traitlets==4.3.3 typing-extensions==3.7.4.2 urllib3==1.25.9 wcwidth==0.1.9 webencodings==0.5.1 yamlordereddictloader==0.4.0 zipp==3.1.0

sjperkins commented 4 years ago

Hmmm, sounds like you're running Caracal inside a container, which in turn runs astronomy applications in containers -- containerception.

I don't run Caracal a lot myself.

I need you to go one level deeper and get a pip freeze of whatever's inside the crystalball container. @SpheMakh @paoloserra could you provide @mpatir with a command to do this?

paoloserra commented 4 years ago

@mpatir must be running CARACal in the standard way, i.e., within a virtual environment.

She's using current CARACal master which, in turn, uses Stimela 1.5.4 and, therefore, Crystalball 0.3.0. All this is with Docker (the other option would have been Singularity).

Sorry but I'm not sure what "pip freeze inside the crystalball container" means.

sjperkins commented 4 years ago

@mpatir must be running CARACal in the standard way, i.e., within a virtual environment.

She's using current CARACal master which, in turn, uses Stimela 1.5.4 and, therefore, Crystalball 0.3.0. All this is with Docker (the other option would have been Singularity).

I would like to confirm this.

I mean run pip freeze inside the container. Something like:

$ docker run -it image/crystalball pip freeze

where image/crystalball should be replaced with whatever the actual crystalball container is.

@mpatir Was the machine running other radio astronomy software in conjunction with crystalball?

mpatir commented 4 years ago

@sjperkins Okay I suppose the actual crystalball container would be somewhere inside Stimela @SpheMakh do you know how I could find this?

The machine was only running crystalball, inside the caracal with the virtual env using Docker container. No other astronomy software -- this was already after selfcal which uses a bunch of astro-softwares was done.

sjperkins commented 4 years ago

@sjperkins Okay I suppose the actual crystalball container would be somewhere inside Stimela @SpheMakh do you know how I could find this?

Try the followiing (if Caracal is using docker) to get the image name. I have no XP with singularity.

$ docker images | grep crystalball

For example on my machine, docker images produces:

$ docker images
REPOSITORY                                               TAG                 IMAGE ID            CREATED             SIZE
casadev                                                  latest              47b4cdde2dfa        7 months ago        910MB
kernsuite/base                                           5                   94c877890bc0        10 months ago       191MB
tricolour-test                                           latest              4869dcea3b2b        10 months ago       1.43GB
ansible                                                  test                4a20382178af        12 months ago       246MB
ubuntu                                                   18.04               7698f282e524        12 months ago       69.9MB
alpine                                                   latest              055936d39205        12 months ago       5.53MB
kernsuite/base                                           <none>              befa68388c76        16 months ago       221MB
busybox                                                  latest              59788edf1f3e        19 months ago       1.15MB
kernsuite/base                                           dev                 1c14129ea271        2 years ago         109MB
cubical                                                  latest              52478bd4d731        2 years ago         960MB
kernsuite/base                                           3                   02e463d74e9d        2 years ago         257MB
xenial/obit                                              dev                 d18648fecbb6        2 years ago         3.65GB
ubuntu                                                   xenial-20170802     ccc7a11d65b1        2 years ago         120MB
ubuntu                                                   xenial-20170619     d355ed3537e9        2 years ago         119MB
ubuntu                                                   trusty-20170620     4a2820e686c4        2 years ago         188MB
ubuntu                                                   16.04               f49eec89601e        3 years ago         129MB
nvidia/cuda                                              7.5-cudnn5-devel    886fbf107999        3 years ago         1.35GB
sjperkins/dist-mb-paper                                  latest              f147c386a7a7        3 years ago         1.47GB
ubuntu                                                   xenial-20161010     f753707788c5        3 years ago         127MB
ceph/daemon                                              latest              759578d288cb        3 years ago         1.08GB
ubuntu                                                   trusty-20160819     4a725d3b3b1c        3 years ago         188MB
sjperkins/dev                                            latest              e4227df2d2d8        3 years ago         1.67GB
sjperkins/squid                                          3.5.19-r0           83a9bd3ccba9        3 years ago         22MB
alpine                                                   3.4                 f70c828098f5        3 years ago         4.8MB
nvidia/cuda                                              latest              c92f3b5f6760        4 years ago         1.23GB
alpine                                                   3.3                 13e1761bf172        4 years ago         4.8MB
ubuntu                                                   14.04               90d5884b1ee0        4 years ago         188MB

The image name is in the first column.

If this is a singularity install we'll need @SpheMakh's help.

sjperkins commented 4 years ago

Sorry the image name is combination of REPOSITORY and TAG

so in the above

$ docker run -it kernsuite/base:5 /bin/bash pip freeze
mpatir commented 4 years ago

I think I found the container thing: I ran "docker run -it mramatsoku_cab/crystalball pip freeze"

@sjperkins is below what you are looking for?

appdirs==1.4.3 asn1crypto==0.24.0 astropy==4.0.1 codex-africanus==0.2.0 cryptography==2.1.4 crystalball==0.2.4 dask==2.13.0 dask-ms==0.2.3 decorator==4.4.2 idna==2.6 keyring==10.6.0 keyrings.alt==3.0 llvmlite==0.31.0 numba==0.48.0 numpy==1.18.2 psutil==5.7.0 pycrypto==2.6.1 pygobject==3.26.1 python-apt==1.6.4 python-casacore==3.0.0 pyxdg==0.25 PyYAML==5.3.1 regions==0.4 SecretStorage==2.3.1 six==1.11.0 toolz==0.10.0 unattended-upgrades==0.1

sjperkins commented 4 years ago

I think I found the container thing: I ran "docker run -it mramatsoku_cab/crystalball pip freeze"

@sjperkins is below what you are looking for?

Yes indeed! Thanks @mpatir.

This container looks old, crystalball is on the 7th Jan 0.2.4 version, not the 9th April 0.3.0 release

codex-africanus==0.2.0 crystalball==0.2.4 dask==2.13.0 dask-ms==0.2.3 python-casacore==3.0.0

mpatir commented 4 years ago

@sjperkins Ah!! I see this crystalball is old.

Then now I just need to figure why my 4-days-old caracal installation has an this old crystalball.

Thank you.

paoloserra commented 4 years ago

was this a fresh installation on a new venv following the instructions at https://caracal.readthedocs.io/en/latest/caracalREADME.html ?

mpatir commented 4 years ago

Yes, followed those instructions until... pip install -U git+https://github.com/ska-sa/caracal.git

which does not actually create a caracal directory for me

So I did

git clone https://github.com/ska-sa/caracal.git pip install -U --force-reinstall /caracal stimela clean -ac

paoloserra commented 4 years ago

mmmm, that'd be most bizzarre... @sjperkins , are you sure that docker images | grep crystalball will always give the correct answer? If I do that on my machine I get a super-dodgy container which appears to be 4 months old, while even just judging from the format of the progress bar that's definitely not right.

In fact, @mpatir , could you share the progress bar printed to screen while you were running crystalball? Crystalball 0.3.0 has its own progress bar, very different from the old one, and just from that we can probably tell which crystalball you're running.

mpatir commented 4 years ago

also, I should mention that caracal says it has been using crystalball 0.3.0 all this time See snippet below from a few days ago - so I am not sure why the cab is the old crystalball

2020-05-17 22:03:33 | INFO | crystalball:_predict | Crystalball version 0.3.0

sjperkins commented 4 years ago

are you sure that docker images | grep crystalball will always give the correct answer?

No I'm not sure, it may be worth scanning through the entire docker images output.

mpatir commented 4 years ago

okay here is the full progress traceback: running docker create --user 1006:1100 --shm-size 1gb -v /14TB/mramatsoku/Projects/Norma/.stimela_workdir-15899713002257626/stimela_parameter_files/transfer_model_field0_ms0-13975918117291215899713857716208.json:/stimela_mount/configfile:ro -v /14TB/mramatsoku/Software/myvenv/lib/python3.6/site-packages/stimela/cargo/cab/crystalball/src:/stimela_mount/code:ro -v /14TB/mramatsoku/Projects/Norma/.stimela_workdir-15899713002257626/passwd:/etc/passwd:rw -v /14TB/mramatsoku/Projects/Norma/.stimela_workdir-15899713002257626/group:/etc/group:rw -v /14TB/mramatsoku/Software/myvenv/bin/stimela_runscript:/docker_run:ro -v /14TB/mramatsoku/Projects/Norma/msdir:/stimela_mount/msdir:rw -v /14TB/mramatsoku/Projects/Norma/input:/stimela_mount/input:ro -v /14TB/mramatsoku/Projects/Norma/output:/stimela_mount/output:rw -v /14TB/mramatsoku/Projects/Norma/output/tmp:/stimela_mount/output/tmp:rw -e CONFIG=/stimela_mount/configfile -e HOME=/stimela_mount/output -e STIMELA_MOUNT=/stimela_mount -e MSDIR=/stimela_mount/msdir -e INPUT=/stimela_mount/input -e OUTPUT=/stimela_mount/output -e TMPDIR=/stimela_mount/output/tmp --rm -w /stimela_mount/output --name transfer_model_field0_ms0-13975918117291215899713857716208 stimela/codex-africanus:1.3.2 /docker_run 58988d8c3621db326e91828147ae2a3cffced8917ba3d21c2b22f5b0a19ff444 Starting container [transfer_model_field0_ms0-13975918117291215899713857716208]. Timeout set to -1. The container ID is printed below. running docker start -a transfer_model_field0_ms0-13975918117291215899713857716208 2020-05-20 10:43:09 | INFO | crystalball:_predict | Crystalball version 0.3.0 2020-05-20 10:43:11 | INFO | wsclean:import_from_wsclean | /stimela_mount/output/continuum/image_5/mypipelinerun_ESO137_001_5-sources.txt contains 42098 components 2020-05-20 10:43:11 | INFO | wsclean:import_from_wsclean | Total flux of 42098 selected components is 62.832971 Jy 2020-05-20 10:43:11 | INFO | budget:get_budget | -------------------------------------------------- 2020-05-20 10:43:11 | INFO | budget:get_budget | Budgeting 2020-05-20 10:43:11 | INFO | budget:get_budget | -------------------------------------------------- 2020-05-20 10:43:11 | INFO | budget:get_budget | system RAM = 251.78 GB 2020-05-20 10:43:11 | INFO | budget:get_budget | nr of logical CPUs = 56 2020-05-20 10:43:11 | INFO | budget:get_budget | nr sources = 42098 2020-05-20 10:43:11 | INFO | budget:get_budget | nr rows = 5387939 2020-05-20 10:43:11 | INFO | budget:get_budget | nr chans = 402 2020-05-20 10:43:11 | INFO | budget:get_budget | nr corrs = 2 2020-05-20 10:43:11 | INFO | budget:get_budget | sources per chunk = 3002 (auto settings) 2020-05-20 10:43:11 | INFO | budget:get_budget | rows per chunk = 300222 (auto settings) 2020-05-20 10:43:11 | INFO | budget:get_budget | expected memory usage = 125.89 GB 2020-05-20 10:43:12 | INFO | crystalball:_predict | Field ESO137-001 DDID 0 rows 5387939 chans 402 corrs 2 Successful read/write open of default-locked table /stimela_mount/msdir/Sub1557347448_sdp_l0-ESO137_001_corr.ms: 28 columns, 5387939 rows

[##########################################] | 100% Complete (Estimate) | 2m 0s / ~ 2m 0s [##########################################] | 100% Complete (Estimate) | 2m 5s / ~ 2m 5s [##########################################] | 100% Complete (Estimate) | 2m10s / ~ 2m10s [##########################################] | 100% Complete (Estimate) | 2m15s / ~ 2m15s

sjperkins commented 4 years ago

Looks like its stimela/codex-africanus:1.3.2 according to:

https://github.com/ratt-ru/Stimela/blob/fbad7bd714433a93ec244760637697daffef2877/stimela/cargo/cab/crystalball/parameters.json#L3-L4

Can you please do a pip freeze on that container too? I believe that it's running crystall 0.3.0 but I'm wondering what else is inside the container.

docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze
paoloserra commented 4 years ago

yeah, that looks like the latest crystalball

@sjperkins is it possible that the problem that was affecting tricolour and other dask-ms applications is now affecting crystalball, too?

sjperkins commented 4 years ago

The image in https://github.com/paoloserra/crystalball/issues/37#issue-622383887 shows a spike after about 6 hours of compute and again at 12.

image

I'm inclined to blame the garbage collector in combination with very large arrays at this point. We've seen issues with Cubical where very large arrays are allocated, computed on and not garbage collected for technical reasons that I won't go into here.

In this regard I do note that memory_fraction=0.5 or half the system's memory. It's no longer necessary to request this much -- I think crystallball will do well with 0.05 or 0.1. Would you be willing to try that and see if the memory issues re-occur?

Reducing the memory_fraction will increase the number of memory allocations and deallocations which, for "reasons", will nudge the garbage collector into behaving.

If you want to know about the technical reasons, please shout!

sjperkins commented 4 years ago

Also, could you please post?

docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze
paoloserra commented 4 years ago

for technical reasons that I won't go into here.

god bless you! :)

paoloserra commented 4 years ago

In this regard I do note that memory_fraction=0.5 or half the system's memory. It's no longer necessary to request this much -- I think crystallball will do well with 0.05 or 0.1.

but it will be slower, right?

sjperkins commented 4 years ago

but it will be slower, right?

No, also for "technical reasons" ;-)

paoloserra commented 4 years ago

Actually, I'd like to understand this.

Lower memory-fraction -> smaller chunk size -> more chunks to process.

Is it simply that processing a Nx smaller chunk takes an Nx shorter time? When does this stop being true?

mpatir commented 4 years ago

@sjperkins @paoloserra I was in a meeting:

So I ran docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze

but it is complaining about some syntax error: /usr/local/bin/pip: pip: line 9: syntax error near unexpected token (' /usr/local/bin/pip: pip: line 9: sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0])'

also so I missed a lot when I was gone: So should I try with the low mem_fraction and trust that it won't take forever? I could certainly test this tomorrow when the work machines come to life again.

sjperkins commented 4 years ago

Actually, I'd like to understand this.

Lower memory-fraction -> smaller chunk size -> more chunks to process.

Is it simply that processing a Nx smaller chunk takes an Nx shorter time? When does this stop being true?

wsclean's model only uses I, not Q, U and V. It's also well-defined. This means that it's no longer necessary to store O(V x S) source coherencies in memory from O(V) and O(S) inputs where V=visibilities and S=sources. One can go straight to the O(V) output while performing O(V x S) compute.

sjperkins commented 4 years ago

@sjperkins @paoloserra I was in a meeting:

So I ran docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze

but it is complaining about some syntax error: /usr/local/bin/pip: pip: line 9: syntax error near unexpected token (' /usr/local/bin/pip: pip: line 9: sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0])'

also so I missed a lot when I was gone: So should I try with the low mem_fraction and trust that it won't take forever? I could certainly test this tomorrow when the work machines come to life again.

Try the following (no /bin/bash)?

docker run -it stimela/codex-africanus:1.3.2 pip freeze
mpatir commented 4 years ago

@sjperkins @paoloserra I was in a meeting: So I ran docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze but it is complaining about some syntax error: /usr/local/bin/pip: pip: line 9: syntax error near unexpected token (' /usr/local/bin/pip: pip: line 9: sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0])' also so I missed a lot when I was gone: So should I try with the low mem_fraction and trust that it won't take forever? I could certainly test this tomorrow when the work machines come to life again.

Try the following (no /bin/bash)?

docker run -it stimela/codex-africanus:1.3.2 pip freeze

Here is what comes out: now it is the right crystalball version

aiocontextvars==0.2.2 appdirs==1.4.3 asn1crypto==0.24.0 astropy==4.0.1.post1 codex-africanus==0.2.2 contextvars==2.4 cryptography==2.1.4 crystalball==0.3.0 dask==2.14.0 dask-ms==0.2.3 decorator==4.4.2 idna==2.6 immutables==0.11 keyring==10.6.0 keyrings.alt==3.0 llvmlite==0.31.0 loguru==0.4.1 numba==0.48.0 numpy==1.18.2 psutil==5.7.0 pycrypto==2.6.1 pygobject==3.26.1 python-apt==1.6.4 python-casacore==3.0.0 pyxdg==0.25 PyYAML==5.3.1 regions==0.4 SecretStorage==2.3.1 six==1.11.0 toolz==0.10.0 unattended-upgrades==0.1

sjperkins commented 4 years ago

Thanks @mpatir

dask-ms==0.2.3 python-casacore==3.0.0

The python-casacore == 3.0.0 could be problematic but that would manifest as a hang where no work is done. Instead we're seeing OutOfMemory errors.

Could you please try lowering the memory_fraction to 0.1?

mpatir commented 4 years ago

Thanks @mpatir

dask-ms==0.2.3 python-casacore==3.0.0

The python-casacore == 3.0.0 could be problematic but that would manifest as a hang where no work is done. Instead we're seeing OutOfMemory errors.

Could you please try lowering the memory_fraction to 0.1?

Okay, I will try with this and let you know when it eventually dies. Crystallball normally dies after 17 hours or so it might take longer to die with the lower fraction. I will let you know.

I will keep the issue open for now, thank you

sjperkins commented 4 years ago

Okay, I will try with this and let you know when it eventually dies. Crystallball normally dies after 17 hours or so it might take longer to die with the lower fraction. I will let you know.

I'll bet good Mirto that either it takes the same amount of time to die.

Try it with 0.05. I'm using this setting on my laptop and the cores are fully exercised.

sjperkins commented 4 years ago

Also, if my hunch regarding the garbage collector is right, then it may not die at all!

mpatir commented 4 years ago

Okay, I will try with this and let you know when it eventually dies. Crystallball normally dies after 17 hours or so it might take longer to die with the lower fraction. I will let you know.

I'll bet good Mirto that either it takes the same amount of time to die.

I will take this action :) ..for a good bottle of south african Gin

Try it with 0.05. I'm using this setting on my laptop and the cores are fully exercised.

I will try it with this very low memory fraction and see what happens.

I am not sure what the garbage collector is but I do hope it behaves.

paoloserra commented 4 years ago

cool, and if that works we should change the default memory fraction in CARACal

mpatir commented 4 years ago

Solved:

Crystalball in the current caracal version runs through all the way and is stable at mem_frac = 0.01. It takes the same amount of time to transfer the model to visibilities as it had estimated at mem_frac = 0.5 (i.e.., a lower mem_frac does not seem to slow it down) and the memory usage is stable all the way through (see plot)

crystalball_memory_usage2

@paoloserra @sjperkins Thank you very much. I will close the issue now.

@sjperkins I guess I owe you mirto now :) ... I need to get better at hedging my bets.

sjperkins commented 4 years ago

@mpatir Glad to hear that resolves the issue!