Closed mpatir closed 4 years ago
@mpatir Can you run a pip freeze inside the caracal crystallball container and post the output here?
@sjperkins "caracal crystallball container" -- is this just the normal container? or does crystalball use other container tech?.
I am running caracal inside docker.
below is what I get when I run pip freeze.
APLpy==2.0.3
appdirs==1.4.4
astropy==4.0.1.post1
astropy-healpix==0.5
astroquery==0.4
attrs==19.3.0
backcall==0.1.0
beautifulsoup4==4.9.1
bleach==3.1.5
bokeh==2.0.2
caracal @ file:///14TB/mramatsoku/Software/caracal
certifi==2020.4.5.1
cffi==1.14.0
chardet==3.0.4
cryptography==2.9.2
cycler==0.10.0
Cython==0.29.18
decorator==4.4.2
defusedxml==0.6.0
docopt==0.6.2
entrypoints==0.3
future-fstrings==1.2.0
html5lib==1.0.1
idna==2.9
imagecodecs==2020.2.18
imageio==2.8.0
importlib-metadata==1.6.0
ipdb==0.13.2
ipython==7.14.0
ipython-genutils==0.2.0
jedi==0.17.0
jeepney==0.4.3
Jinja2==2.11.2
jsonschema==3.2.0
jupyter-core==4.6.3
keyring==21.2.1
kiwisolver==1.2.0
MarkupSafe==1.1.1
matplotlib==3.2.1
mistune==0.8.4
nbconvert==5.6.1
nbformat==5.0.6
networkx==2.4
nose==1.3.7
npyscreen==4.10.5
numpy==1.18.4
packaging==20.4
pandas==1.0.3
pandocfilters==1.4.2
parso==0.7.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.1.2
pkg-resources==0.0.0
pooch==1.1.0
progressbar2==3.51.3
prompt-toolkit==3.0.5
ptyprocess==0.6.0
PyAVM==0.9.4
pycparser==2.20
Pygments==2.6.1
pykwalify==1.7.0
pyparsing==2.4.7
pyregion==2.0
pyrsistent==0.16.0
pysolr==3.9.0
python-dateutil==2.8.1
python-utils==2.4.0
pytz==2020.1
PyWavelets==1.1.1
PyYAML==5.3.1
radiopadre-client==1.0rc10
regions==0.4
reproject==0.7
requests==2.23.0
ruamel.yaml==0.16.10
ruamel.yaml.clib==0.2.0
scikit-image==0.17.2
scipy==1.4.1
SecretStorage==3.1.2
Shapely==1.7.0
six==1.14.0
soupsieve==2.0.1
stimela==1.5.4
testpath==0.4.4
tifffile==2020.5.11
tornado==6.0.4
traitlets==4.3.3
typing-extensions==3.7.4.2
urllib3==1.25.9
wcwidth==0.1.9
webencodings==0.5.1
yamlordereddictloader==0.4.0
zipp==3.1.0
Hmmm, sounds like you're running Caracal inside a container, which in turn runs astronomy applications in containers -- containerception.
I don't run Caracal a lot myself.
I need you to go one level deeper and get a pip freeze of whatever's inside the crystalball container. @SpheMakh @paoloserra could you provide @mpatir with a command to do this?
@mpatir must be running CARACal in the standard way, i.e., within a virtual environment.
She's using current CARACal master which, in turn, uses Stimela 1.5.4 and, therefore, Crystalball 0.3.0. All this is with Docker (the other option would have been Singularity).
Sorry but I'm not sure what "pip freeze inside the crystalball container" means.
@mpatir must be running CARACal in the standard way, i.e., within a virtual environment.
She's using current CARACal master which, in turn, uses Stimela 1.5.4 and, therefore, Crystalball 0.3.0. All this is with Docker (the other option would have been Singularity).
I would like to confirm this.
I mean run pip freeze inside the container. Something like:
$ docker run -it image/crystalball pip freeze
where image/crystalball
should be replaced with whatever the actual crystalball container is.
@mpatir Was the machine running other radio astronomy software in conjunction with crystalball?
@sjperkins Okay I suppose the actual crystalball container would be somewhere inside Stimela @SpheMakh do you know how I could find this?
The machine was only running crystalball, inside the caracal with the virtual env using Docker container. No other astronomy software -- this was already after selfcal which uses a bunch of astro-softwares was done.
@sjperkins Okay I suppose the actual crystalball container would be somewhere inside Stimela @SpheMakh do you know how I could find this?
Try the followiing (if Caracal is using docker) to get the image name. I have no XP with singularity.
$ docker images | grep crystalball
For example on my machine, docker images
produces:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
casadev latest 47b4cdde2dfa 7 months ago 910MB
kernsuite/base 5 94c877890bc0 10 months ago 191MB
tricolour-test latest 4869dcea3b2b 10 months ago 1.43GB
ansible test 4a20382178af 12 months ago 246MB
ubuntu 18.04 7698f282e524 12 months ago 69.9MB
alpine latest 055936d39205 12 months ago 5.53MB
kernsuite/base <none> befa68388c76 16 months ago 221MB
busybox latest 59788edf1f3e 19 months ago 1.15MB
kernsuite/base dev 1c14129ea271 2 years ago 109MB
cubical latest 52478bd4d731 2 years ago 960MB
kernsuite/base 3 02e463d74e9d 2 years ago 257MB
xenial/obit dev d18648fecbb6 2 years ago 3.65GB
ubuntu xenial-20170802 ccc7a11d65b1 2 years ago 120MB
ubuntu xenial-20170619 d355ed3537e9 2 years ago 119MB
ubuntu trusty-20170620 4a2820e686c4 2 years ago 188MB
ubuntu 16.04 f49eec89601e 3 years ago 129MB
nvidia/cuda 7.5-cudnn5-devel 886fbf107999 3 years ago 1.35GB
sjperkins/dist-mb-paper latest f147c386a7a7 3 years ago 1.47GB
ubuntu xenial-20161010 f753707788c5 3 years ago 127MB
ceph/daemon latest 759578d288cb 3 years ago 1.08GB
ubuntu trusty-20160819 4a725d3b3b1c 3 years ago 188MB
sjperkins/dev latest e4227df2d2d8 3 years ago 1.67GB
sjperkins/squid 3.5.19-r0 83a9bd3ccba9 3 years ago 22MB
alpine 3.4 f70c828098f5 3 years ago 4.8MB
nvidia/cuda latest c92f3b5f6760 4 years ago 1.23GB
alpine 3.3 13e1761bf172 4 years ago 4.8MB
ubuntu 14.04 90d5884b1ee0 4 years ago 188MB
The image name is in the first column.
If this is a singularity install we'll need @SpheMakh's help.
Sorry the image name is combination of REPOSITORY and TAG
so in the above
$ docker run -it kernsuite/base:5 /bin/bash pip freeze
I think I found the container thing: I ran "docker run -it mramatsoku_cab/crystalball pip freeze"
@sjperkins is below what you are looking for?
appdirs==1.4.3 asn1crypto==0.24.0 astropy==4.0.1 codex-africanus==0.2.0 cryptography==2.1.4 crystalball==0.2.4 dask==2.13.0 dask-ms==0.2.3 decorator==4.4.2 idna==2.6 keyring==10.6.0 keyrings.alt==3.0 llvmlite==0.31.0 numba==0.48.0 numpy==1.18.2 psutil==5.7.0 pycrypto==2.6.1 pygobject==3.26.1 python-apt==1.6.4 python-casacore==3.0.0 pyxdg==0.25 PyYAML==5.3.1 regions==0.4 SecretStorage==2.3.1 six==1.11.0 toolz==0.10.0 unattended-upgrades==0.1
I think I found the container thing: I ran "docker run -it mramatsoku_cab/crystalball pip freeze"
@sjperkins is below what you are looking for?
Yes indeed! Thanks @mpatir.
This container looks old, crystalball is on the 7th Jan 0.2.4 version, not the 9th April 0.3.0 release
codex-africanus==0.2.0 crystalball==0.2.4 dask==2.13.0 dask-ms==0.2.3 python-casacore==3.0.0
@sjperkins Ah!! I see this crystalball is old.
Then now I just need to figure why my 4-days-old caracal installation has an this old crystalball.
Thank you.
was this a fresh installation on a new venv following the instructions at https://caracal.readthedocs.io/en/latest/caracalREADME.html ?
Yes, followed those instructions until... pip install -U git+https://github.com/ska-sa/caracal.git
which does not actually create a caracal directory for me
So I did
git clone https://github.com/ska-sa/caracal.git
pip install -U --force-reinstall
mmmm, that'd be most bizzarre... @sjperkins , are you sure that docker images | grep crystalball
will always give the correct answer? If I do that on my machine I get a super-dodgy container which appears to be 4 months old, while even just judging from the format of the progress bar that's definitely not right.
In fact, @mpatir , could you share the progress bar printed to screen while you were running crystalball? Crystalball 0.3.0 has its own progress bar, very different from the old one, and just from that we can probably tell which crystalball you're running.
also, I should mention that caracal says it has been using crystalball 0.3.0 all this time See snippet below from a few days ago - so I am not sure why the cab is the old crystalball
2020-05-17 22:03:33 | INFO | crystalball:_predict | Crystalball version 0.3.0
are you sure that docker images | grep crystalball will always give the correct answer?
No I'm not sure, it may be worth scanning through the entire docker images
output.
okay here is the full progress traceback: running docker create --user 1006:1100 --shm-size 1gb -v /14TB/mramatsoku/Projects/Norma/.stimela_workdir-15899713002257626/stimela_parameter_files/transfer_model_field0_ms0-13975918117291215899713857716208.json:/stimela_mount/configfile:ro -v /14TB/mramatsoku/Software/myvenv/lib/python3.6/site-packages/stimela/cargo/cab/crystalball/src:/stimela_mount/code:ro -v /14TB/mramatsoku/Projects/Norma/.stimela_workdir-15899713002257626/passwd:/etc/passwd:rw -v /14TB/mramatsoku/Projects/Norma/.stimela_workdir-15899713002257626/group:/etc/group:rw -v /14TB/mramatsoku/Software/myvenv/bin/stimela_runscript:/docker_run:ro -v /14TB/mramatsoku/Projects/Norma/msdir:/stimela_mount/msdir:rw -v /14TB/mramatsoku/Projects/Norma/input:/stimela_mount/input:ro -v /14TB/mramatsoku/Projects/Norma/output:/stimela_mount/output:rw -v /14TB/mramatsoku/Projects/Norma/output/tmp:/stimela_mount/output/tmp:rw -e CONFIG=/stimela_mount/configfile -e HOME=/stimela_mount/output -e STIMELA_MOUNT=/stimela_mount -e MSDIR=/stimela_mount/msdir -e INPUT=/stimela_mount/input -e OUTPUT=/stimela_mount/output -e TMPDIR=/stimela_mount/output/tmp --rm -w /stimela_mount/output --name transfer_model_field0_ms0-13975918117291215899713857716208 stimela/codex-africanus:1.3.2 /docker_run 58988d8c3621db326e91828147ae2a3cffced8917ba3d21c2b22f5b0a19ff444 Starting container [transfer_model_field0_ms0-13975918117291215899713857716208]. Timeout set to -1. The container ID is printed below. running docker start -a transfer_model_field0_ms0-13975918117291215899713857716208 2020-05-20 10:43:09 | INFO | crystalball:_predict | Crystalball version 0.3.0 2020-05-20 10:43:11 | INFO | wsclean:import_from_wsclean | /stimela_mount/output/continuum/image_5/mypipelinerun_ESO137_001_5-sources.txt contains 42098 components 2020-05-20 10:43:11 | INFO | wsclean:import_from_wsclean | Total flux of 42098 selected components is 62.832971 Jy 2020-05-20 10:43:11 | INFO | budget:get_budget | -------------------------------------------------- 2020-05-20 10:43:11 | INFO | budget:get_budget | Budgeting 2020-05-20 10:43:11 | INFO | budget:get_budget | -------------------------------------------------- 2020-05-20 10:43:11 | INFO | budget:get_budget | system RAM = 251.78 GB 2020-05-20 10:43:11 | INFO | budget:get_budget | nr of logical CPUs = 56 2020-05-20 10:43:11 | INFO | budget:get_budget | nr sources = 42098 2020-05-20 10:43:11 | INFO | budget:get_budget | nr rows = 5387939 2020-05-20 10:43:11 | INFO | budget:get_budget | nr chans = 402 2020-05-20 10:43:11 | INFO | budget:get_budget | nr corrs = 2 2020-05-20 10:43:11 | INFO | budget:get_budget | sources per chunk = 3002 (auto settings) 2020-05-20 10:43:11 | INFO | budget:get_budget | rows per chunk = 300222 (auto settings) 2020-05-20 10:43:11 | INFO | budget:get_budget | expected memory usage = 125.89 GB 2020-05-20 10:43:12 | INFO | crystalball:_predict | Field ESO137-001 DDID 0 rows 5387939 chans 402 corrs 2 Successful read/write open of default-locked table /stimela_mount/msdir/Sub1557347448_sdp_l0-ESO137_001_corr.ms: 28 columns, 5387939 rows
[##########################################] | 100% Complete (Estimate) | 2m 0s / ~ 2m 0s [##########################################] | 100% Complete (Estimate) | 2m 5s / ~ 2m 5s [##########################################] | 100% Complete (Estimate) | 2m10s / ~ 2m10s [##########################################] | 100% Complete (Estimate) | 2m15s / ~ 2m15s
Looks like its stimela/codex-africanus:1.3.2 according to:
Can you please do a pip freeze on that container too? I believe that it's running crystall 0.3.0 but I'm wondering what else is inside the container.
docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze
yeah, that looks like the latest crystalball
@sjperkins is it possible that the problem that was affecting tricolour and other dask-ms applications is now affecting crystalball, too?
The image in https://github.com/paoloserra/crystalball/issues/37#issue-622383887 shows a spike after about 6 hours of compute and again at 12.
I'm inclined to blame the garbage collector in combination with very large arrays at this point. We've seen issues with Cubical where very large arrays are allocated, computed on and not garbage collected for technical reasons that I won't go into here.
In this regard I do note that memory_fraction=0.5 or half the system's memory. It's no longer necessary to request this much -- I think crystallball will do well with 0.05 or 0.1. Would you be willing to try that and see if the memory issues re-occur?
Reducing the memory_fraction will increase the number of memory allocations and deallocations which, for "reasons", will nudge the garbage collector into behaving.
If you want to know about the technical reasons, please shout!
Also, could you please post?
docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze
for technical reasons that I won't go into here.
god bless you! :)
In this regard I do note that memory_fraction=0.5 or half the system's memory. It's no longer necessary to request this much -- I think crystallball will do well with 0.05 or 0.1.
but it will be slower, right?
but it will be slower, right?
No, also for "technical reasons" ;-)
Actually, I'd like to understand this.
Lower memory-fraction -> smaller chunk size -> more chunks to process.
Is it simply that processing a Nx smaller chunk takes an Nx shorter time? When does this stop being true?
@sjperkins @paoloserra I was in a meeting:
So I ran docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze
but it is complaining about some syntax error:
/usr/local/bin/pip: pip: line 9: syntax error near unexpected token (' /usr/local/bin/pip: pip: line 9:
sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0])'
also so I missed a lot when I was gone: So should I try with the low mem_fraction and trust that it won't take forever? I could certainly test this tomorrow when the work machines come to life again.
Actually, I'd like to understand this.
Lower memory-fraction -> smaller chunk size -> more chunks to process.
Is it simply that processing a Nx smaller chunk takes an Nx shorter time? When does this stop being true?
wsclean's model only uses I, not Q, U and V. It's also well-defined. This means that it's no longer necessary to store O(V x S)
source coherencies in memory from O(V)
and O(S)
inputs where V=visibilities and S=sources. One can go straight to the O(V)
output while performing O(V x S)
compute.
@sjperkins @paoloserra I was in a meeting:
So I ran docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze
but it is complaining about some syntax error: /usr/local/bin/pip: pip: line 9: syntax error near unexpected token
(' /usr/local/bin/pip: pip: line 9:
sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0])'also so I missed a lot when I was gone: So should I try with the low mem_fraction and trust that it won't take forever? I could certainly test this tomorrow when the work machines come to life again.
Try the following (no /bin/bash)?
docker run -it stimela/codex-africanus:1.3.2 pip freeze
@sjperkins @paoloserra I was in a meeting: So I ran docker run -it stimela/codex-africanus:1.3.2 /bin/bash pip freeze but it is complaining about some syntax error: /usr/local/bin/pip: pip: line 9: syntax error near unexpected token
(' /usr/local/bin/pip: pip: line 9:
sys.argv[0] = re.sub(r'(-script.pyw?|.exe)?$', '', sys.argv[0])' also so I missed a lot when I was gone: So should I try with the low mem_fraction and trust that it won't take forever? I could certainly test this tomorrow when the work machines come to life again.Try the following (no /bin/bash)?
docker run -it stimela/codex-africanus:1.3.2 pip freeze
Here is what comes out: now it is the right crystalball version
aiocontextvars==0.2.2 appdirs==1.4.3 asn1crypto==0.24.0 astropy==4.0.1.post1 codex-africanus==0.2.2 contextvars==2.4 cryptography==2.1.4 crystalball==0.3.0 dask==2.14.0 dask-ms==0.2.3 decorator==4.4.2 idna==2.6 immutables==0.11 keyring==10.6.0 keyrings.alt==3.0 llvmlite==0.31.0 loguru==0.4.1 numba==0.48.0 numpy==1.18.2 psutil==5.7.0 pycrypto==2.6.1 pygobject==3.26.1 python-apt==1.6.4 python-casacore==3.0.0 pyxdg==0.25 PyYAML==5.3.1 regions==0.4 SecretStorage==2.3.1 six==1.11.0 toolz==0.10.0 unattended-upgrades==0.1
Thanks @mpatir
dask-ms==0.2.3 python-casacore==3.0.0
The python-casacore == 3.0.0 could be problematic but that would manifest as a hang where no work is done. Instead we're seeing OutOfMemory errors.
Could you please try lowering the memory_fraction to 0.1?
Thanks @mpatir
dask-ms==0.2.3 python-casacore==3.0.0
The python-casacore == 3.0.0 could be problematic but that would manifest as a hang where no work is done. Instead we're seeing OutOfMemory errors.
Could you please try lowering the memory_fraction to 0.1?
Okay, I will try with this and let you know when it eventually dies. Crystallball normally dies after 17 hours or so it might take longer to die with the lower fraction. I will let you know.
I will keep the issue open for now, thank you
Okay, I will try with this and let you know when it eventually dies. Crystallball normally dies after 17 hours or so it might take longer to die with the lower fraction. I will let you know.
I'll bet good Mirto that either it takes the same amount of time to die.
Try it with 0.05. I'm using this setting on my laptop and the cores are fully exercised.
Also, if my hunch regarding the garbage collector is right, then it may not die at all!
Okay, I will try with this and let you know when it eventually dies. Crystallball normally dies after 17 hours or so it might take longer to die with the lower fraction. I will let you know.
I'll bet good Mirto that either it takes the same amount of time to die.
I will take this action :) ..for a good bottle of south african Gin
Try it with 0.05. I'm using this setting on my laptop and the cores are fully exercised.
I will try it with this very low memory fraction and see what happens.
I am not sure what the garbage collector is but I do hope it behaves.
cool, and if that works we should change the default memory fraction in CARACal
Solved:
Crystalball in the current caracal version runs through all the way and is stable at mem_frac = 0.01. It takes the same amount of time to transfer the model to visibilities as it had estimated at mem_frac = 0.5 (i.e.., a lower mem_frac does not seem to slow it down) and the memory usage is stable all the way through (see plot)
@paoloserra @sjperkins Thank you very much. I will close the issue now.
@sjperkins I guess I owe you mirto now :) ... I need to get better at hedging my bets.
@mpatir Glad to hear that resolves the issue!
Description
Trying to transfer the sky model to my visibilities. Crystalball crashed after about 17 hours processing. It appears it keeps increasing the memory usage despite requesting only 50% of my total ram. It does this until it reaches the max ram capabilities of my computer system (see plot).
What I Did
Command '['crystalball', '--sky-model', '/stimela_mount/output/continuum/image_5/mypipelinerun_ESO137_001_5-sources.txt', '--output-column', 'MODEL_DATA', '--row-chunks', '0', '--model-chunks', '0', '--num-sources', '0', '--num-workers', '0', '--memory-fraction', '0.5', '/stimela_mount/msdir/Sub1557347448_sdp_l0-ESO137_001_corr.ms']
Taceback: (not full) MemoryError: Unable to allocate 1.80 GiB for an array with shape (1, 300222, 402, 1) and data type complex128 Traceback (most recent call last): File "/stimela_mount/code/run.py", line 40, in
subprocess.check_call(shlex.split(_runc))
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['crystalball', '--sky-model', '/stimela_mount/output/continuum/image_5/mypipelinerun_ESO137_001_5-sources.txt', '--output-column', 'MODEL_DATA', '--row-chunks', '0', '--model-chunks', '0', '--num-sources', '0', '--num-workers', '0', '--memory-fraction', '0.5', '/stimela_mount/msdir/Sub1557347448_sdp_l0-ESO137_001_corr.ms']' returned non-zero exit status 1.
docker returns error code 1
job failed at 2020-05-21 04:26:36.481523 after 17:43:30.590694
Traceback (most recent call last):