Closed viralp closed 3 years ago
Hi @viralp , sorry about this. From the log, I see you're using an older image of ragavi
. Is it possible to a more recent ragavi
image, eg 1.7.2
?
I also found that synlinking the singularity folder to someplace without space restrictions helps (I think you might be using /home/viralp for that - which has a limit of 20 something gigs. Could you check if that's full?
@KshitijT my home area on Stevie has 100 GB space available. It is not full.
@KshitijT my home area on Stevie has 100 GB space available. It is not full.
Ah. That was certainly one of the causes of that error on ILIFU. @Mulan-94 , I'll assign this to you then.
ok @KshitijT , @viralp could you please run this command: df -ih
where your files are and post the output here?
@Mulan-94 this is the output of df -ih
Filesystem Inodes IUsed IFree IUse% Mounted on
udev 63M 683 63M 1% /dev
tmpfs 63M 1.3K 63M 1% /run
/dev/mapper/stevie_vg_root-stevie_lv_root 392M 8.0M 384M 3% /
tmpfs 63M 3.7K 63M 1% /dev/shm
tmpfs 63M 7 63M 1% /run/lock
tmpfs 63M 18 63M 1% /sys/fs/cgroup
/dev/sda2 120K 306 119K 1% /boot
/dev/sda1 0 0 0 - /boot/efi
/dev/sdb 1.6G 1.5M 1.6G 1% /data3
/dev/loop1 3.6K 3.6K 0 100% /snap/pycharm-community/197
/dev/loop3 13K 13K 0 100% /snap/core/9289
/dev/sdc 2.0G 6.6K 2.0G 1% /data4
/dev/loop4 13K 13K 0 100% /snap/core/9436
ray:/home 339M 6.6M 332M 2% /net/ray/home
/dev/loop5 3.6K 3.6K 0 100% /snap/pycharm-community/202
Eric is getting the same thing. Has this been addressed?
@molnard89 also got this error on ilifu while plotting gains.
Disc space was plentiful. @Jordatious helped us there, and we think that the process was writing to memory and reached the max available. So it's RAM, not disc space.
Still, this is not right because plotting gains should not really use much memory. In our case we had 230GB RAM available.
On local INAF machines, the RAM usage of the ragavi-gains is negligible indeed. What's different on ilifu?
OSError: [Errno 28] No space left on device is not a memory error. Even in cases when shared memory is used you will get a SIGBUS signal, not that.
I was running into similar issues with singularity and my pipeline when running cubical and outputting baseline dependent solutions into the current directory. Do check that no files are being output into '.', because that outputs into the home directory inside the container instead of the output directory. In singularity that has limited space on a static image.
If the paths for these output files cannot be set instead you can try running your pipeline with docker.
cheers,
On Mon, Sep 7, 2020 at 5:48 PM paoloserra notifications@github.com wrote:
@molnard89 https://github.com/molnard89 also got this error on ilifu while plotting gains.
Disc space was plentiful. @Jordatious https://github.com/Jordatious helped us there, and we think that the process was writing to memory and reached the max available. So it's RAM, not disc space.
Still, this is not right because plotting gains should not really use much memory. In our case we had 230GB RAM available.
On local INAF machines, the RAM usage of the ragavi-gains is negligible indeed. What's different on ilifu?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/caracal-pipeline/caracal/issues/1200#issuecomment-688403845, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6TVPWQDN47AJXV7UTDSET6D5ANCNFSM4OUPW2OQ .
Benjamin Hugo
PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University
Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town
Still, in our case the error happened exactly in correspondence of a RAM usage spike which reached the max allocated, while disc space was plenty.
yes, but check that it is not writing out to '.' at that point. It might just have been coincidence.
On Mon, Sep 7, 2020 at 6:41 PM paoloserra notifications@github.com wrote:
Still, in our case the error happened exactly in correspondence of a RAM usage spike which reached the max allocated, while disc space was plenty.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/caracal-pipeline/caracal/issues/1200#issuecomment-688429805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6W6IHDCSNCX47NLPITSEUEMZANCNFSM4OUPW2OQ .
Benjamin Hugo
PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University
Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town
I think this may be because we've configured cgroups to kill SLURM jobs that go outside their allocation on the ilifu cluster. This was in Paolo's error logs:
slurmstepd: error: Detected 1 oom-kill event(s) in step 98872.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
just have been coincidence.
maybe so, but the RAM usage should stay low during a ragavi-vis run, as on local machines
Is this plotting visibilities or solutions? Were those runs for 4k or 32k data? If it were for 4k and now this is 32k visibilities, I wouldn't be surprised.
it's plotting gain solutions
That is strange then. However, the monitoring does show the memory climbing up at that stage to the 220 GB limit that was set.
indeed
Having a look through the logs, it seems the logging time stamps for ragavi
is set to UCT, while it's set to the local time for CARACal.Stimela
, so it's difficult to determine the timing. But seems to be basically what you've said here, which is strange. Is this all called within one script? If so, there may be memory that hasn't been cleaned up.
Yeah, just look at the caracal time stamps, not those internal to the container.
The RAM status before starting this containter is also reported in the log and, unless I remember wrong, it was alright.
Ahh yes I see, you're right. From a quick google I found these, but it doesn't seem like they're related. I'll include them in case they are:
https://github.com/bokeh/bokeh/issues/9504 https://github.com/bokeh/bokeh/issues/8626
Do check that no files are being output into '.', because that outputs into the home directory inside the container instead of the output directory. In singularity that has limited space on a static image.
This rings a bell! It might be writing ragavi.log
to .
(or wait, @SpheMakh, does Stimela always chdir into the working directory before doing anything else? That would not be it then.)
@viralp's stack trace contains this:
File "/usr/local/lib/python3.6/dist-packages/bokeh/io/saving.py", line 150, in _save_helper
f.write(html)
OSError: [Errno 28] No space left on device
So it's writing the HTML file. But which HTML file? Is it possible that it's writing a to temp file (under ~
) first, before moving that file to the intended destination path? @Mulan-94, could you poke around the Bokeh code to check this?
Also, it's always possible that some other library or package is sneakily writing stuff to ~
(~/.cache
and stuff like that), on the assumption that this is a safe thing to do.
Could somebody do an lsof
while the ragavi cab is running, and see what's being held open?
Do check that no files are being output into '.', because that outputs into the home directory inside the container instead of the output directory. In singularity that has limited space on a static image.
This rings a bell! It might be writing
ragavi.log
to.
(or wait, @SpheMakh, does Stimela always chdir into the working directory before doing anything else? That would not be it then.)@viralp's stack trace contains this:
File "/usr/local/lib/python3.6/dist-packages/bokeh/io/saving.py", line 150, in _save_helper f.write(html) OSError: [Errno 28] No space left on device
So it's writing the HTML file. But which HTML file? Is it possible that it's writing a to temp file (under
~
) first, before moving that file to the intended destination path? @Mulan-94, could you poke around the Bokeh code to check this?Also, it's always possible that some other library or package is sneakily writing stuff to
~
(~/.cache
and stuff like that), on the assumption that this is a safe thing to do.Could somebody do an
lsof
while the ragavi cab is running, and see what's being held open?
This is something we should change in stimela itself. The runner should chwd into the output directory and set $HOME to be the output directory in the case of running in singularity. Quite a few python libraries try to write config etc. somewhere into home which can cause this kind of error. I've also been missing log files from e.g. politsiyakat because the working directory is not in the output directory. This is not ideal because not all files paths can be customized in all applications / libraries.
I think the first can be fixed inside the util runner Popen call as per documentation (https://docs.python.org/3/library/subprocess.html#subprocess.Popen). The second I'm not quite sure yet because singularity does not accept environment variable specification in the instance start subroutine if I recall. If things are still the same as they were when I last worked with the paths inside stimela we can set HOME in the shell launch script to a fixed path. However, one would need two separate run scripts again then - one for singularity and one for docker because the paths are not the same.
I think the RAM issue is probably separate from the discussion
I solved the problem by putting "export SINGULARITY_PULLFOLDER=/data3/vparekh/" into the .bashrc file and STIMELA_IMAGES located at /data3/vparekh/. As far I remember, ragavi trying to make plots into the home folder where it gives this error.
On Tue, Sep 8, 2020 at 10:46 AM Benjamin Hugo notifications@github.com wrote:
Do check that no files are being output into '.', because that outputs into the home directory inside the container instead of the output directory. In singularity that has limited space on a static image.
This rings a bell! It might be writing ragavi.log to . (or wait, @SpheMakh https://github.com/SpheMakh, does Stimela always chdir into the working directory before doing anything else? That would not be it then.)
@viralp https://github.com/viralp's stack trace contains this:
File "/usr/local/lib/python3.6/dist-packages/bokeh/io/saving.py", line 150, in _save_helper f.write(html) OSError: [Errno 28] No space left on device
So it's writing the HTML file. But which HTML file? Is it possible that it's writing a to temp file (under ~) first, before moving that file to the intended destination path? @Mulan-94 https://github.com/Mulan-94, could you poke around the Bokeh code to check this?
Also, it's always possible that some other library or package is sneakily writing stuff to ~ (~/.cache and stuff like that), on the assumption that this is a safe thing to do.
Could somebody do an lsof while the ragavi cab is running, and see what's being held open?
This is something we should change in stimela itself. The runner should chwd into the output directory and set $HOME to be the output directory in the case of running in singularity. Quite a few python libraries try to write config etc. somewhere into home which can cause this kind of error. I've also been missing log files from e.g. politsiyakat because the working directory is not in the output directory. This is not ideal because not all files paths can be customized in all applications / libraries.
I think the first can be fixed inside the util runner Popen call as per documentation ( https://docs.python.org/3/library/subprocess.html#subprocess.Popen). The second I'm not quite sure yet because singularity does not accept environment variable specification in the instance start subroutine if I recall. If things are still the same as they were when I last worked with the paths inside stimela we can set HOME in the shell launch script to a fixed path. However, one would need two separate run scripts again then - one for singularity and one for docker because the paths are not the same.
I think the RAM issue is probably separate from the discussion
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/caracal-pipeline/caracal/issues/1200#issuecomment-688719655, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBBM46AIYOQWRLMRC5FUUDSEXVNTANCNFSM4OUPW2OQ .
-- Postdoctoral Research Fellow South African Radio Astronomy Observatory (SARAO) and Rhodes University Square Kilometre Array, Cape Town South Africa
Alright, apologies for the confusion about RAM.
It appears that the RAM spike we got on ILIFU is not from ragavi
but from meqtrees
when running the crosscal: set_model
container. Time-wise, ragavi
comes soon after meqtrees
, and I read the timestamps with insufficient precision.
The meqtrees
RAM spike is also a problem, but does not belong here. I'll open a separate issue on it.
We're currently testing the above solution to the ragavi
"No space left on device" issue.
@viralp can you please elaborate a little bit on your solution? From what I understand we can specify the pullfolder when we run caracal like
caracal -c config.yml -ct singularity -sid /software/astro/caracal/STIMELA_IMAGES_1.6.5/
which points to the pre-downloaded images we have on ilifu. For memory caching and disc space to use, following advice from @KshitijT I added the following lines to a .bashrc file I source in the sbatch script:
export SINGULARITY_CACHEDIR=/idia/projects/fornax/.singularity_cache
export SINGULARITY_TMPDIR=/idia/projects/fornax/.singularity_tmp
So this seems to be along the lines of your solution too. Sadly, however, we still get the same errors as before. To re-iterate there are two separate crashes during the crosscal worker run that might be related:
1) meqtrees fails to set the fluxscale
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7ff80bf03710>
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7ff80bf03410>
# <Tigger.Models.ModelClasses.PolarizationWithRM object at 0x7ff80bf03710>
# ### TDL script successfully compiled. 7429 node definitions
# (of which 2 are root nodes) sent to meqserver.
# ### Running TDL job "_simulate_MS"
# Traceback (most recent call last):
# File "/usr/bin/meqtree-pipeliner.py", line 176, in <module>
# res = func(mqs,None,wait=True);
# File "/usr/local/lib/python3.6/dist-packages/Cattery/Siamese/turbo-sim.py", line 239, in _simulate_MS
# mqs.execute('VisDataMux',mssel.create_io_request(),wait=wait);
# File "/usr/lib/python2.7/dist-packages/Timba/Apps/meqserver.py", line 173, in execute
# return self.meq('Node.Execute',rec,wait=wait);
# File "/usr/lib/python2.7/dist-packages/Timba/Apps/meqserver.py", line 126, in meq
# msg = self.await(replyname,resume=True,timeout=wait);
# File "/usr/lib/python2.7/dist-packages/Timba/Apps/multiapp_proxy.py", line 524, in await
# raise RuntimeError,"lost all connections while waiting for event "+str(what);
# RuntimeError: lost all connections while waiting for event Result.Node.execute.1
# ### Job terminated with exception:
# ### No more commands
# ### The meqserver appears to have died on us :( Please check for core files and such.
# ### All your batch are not belong to us, returning with error code
# 2020-09-10 06:48:00: meqtree-pipeliner.py exited with code 1
this coincides with a massive RAM spike that eventually hits the ilifu maximum of 220 GB. At this point meqtrees defaults back to a central pointsource with a flat spectrum and 1 Jy flux density and the pipeline continues (we might need open a separate issue on this).
2) when trying to plot bandpass gains the pipeline finally crashes with
# running cd /idia/projects/fornax/data_processing/comdata/.stimela_workdir-1599718759586301 && singularity run --workdir /idia/projects/fornax/data_processing/comdata/.stim$
# 2020-09-10 07:43:35: Initial memory state:
# 2020-09-10 07:43:35: total used free shared buff/cache available
# 2020-09-10 07:43:35: Mem: 236G 2.3G 169G 996K 64G 231G
# 2020-09-10 07:43:35: Swap: 0B 0B 0B
# 2020-09-10 07:43:35: Running ragavi-gains --table /stimela_mount/msdir/zoomCom-1588316462_sdp_l0-1gc1_primary.B0 --gaintype B --corr --cmap coolwarm --doplot ap --field 1$
# 10.09.2020@07:43:39 - ragavi - INFO - Acquiring table: zoomCom-1588316462_sdp_l0-1gc1_primary.B0
# 10.09.2020@07:43:41 - ragavi - INFO - Table type: B Jones
# 10.09.2020@07:43:41 - ragavi - INFO - Spw: 0, Field: J0408-6545, Corr: 0 amplitude
# 10.09.2020@07:43:41 - ragavi - INFO - Table type: B Jones
# 10.09.2020@07:43:58 - ragavi - INFO - Spw: 0, Field: J0408-6545, Corr: 1 amplitude
# 10.09.2020@07:43:58 - ragavi - INFO - Table type: B Jones
# 10.09.2020@07:44:12 - ragavi - INFO - Spw: 0, Field: J0408-6545, Corr: 0 phase
# 10.09.2020@07:44:13 - ragavi - INFO - Table type: B Jones
# 10.09.2020@07:44:29 - ragavi - INFO - Spw: 0, Field: J0408-6545, Corr: 1 phase
# 10.09.2020@07:44:29 - ragavi - INFO - Table type: B Jones
# 10.09.2020@07:44:55 - ragavi - INFO - Table /stimela_mount/msdir/zoomCom-1588316462_sdp_l0-1gc1_primary.B0 done.
# 10.09.2020@07:45:00 - ragavi - ERROR - Oops ... !
# Traceback (most recent call last):
# File "/usr/local/bin/ragavi-gains", line 19, in <module>
# main(options=options)
# File "/usr/local/lib/python3.6/dist-packages/ragavi/ragavi.py", line 2115, in main
# save_html(html_name, final_layout)
# File "/usr/local/lib/python3.6/dist-packages/ragavi/ragavi.py", line 1501, in save_html
# output = save(plot_layout, name, title=name)
# File "/usr/local/lib/python3.6/dist-packages/bokeh/io/saving.py", line 85, in save
# _save_helper(obj, filename, resources, title, template)
# File "/usr/local/lib/python3.6/dist-packages/bokeh/io/saving.py", line 150, in _save_helper
# f.write(html)
# OSError: [Errno 28] No space left on device
# 2020-09-10 07:45:01: ragavi-gains exited with code 1
so a similar issue as discussed here.
Addendum: if I reduce the number of channels from ~8.2k to 100, and run crosscal with the same config file, none of the above issues are present and I get correctly flux scaled output, plotted bandpass gains and no crash. So at least we know they are both due to large data volume and resource limitations, and therefore are connected indeed.
The meqtrees
RAM issue happens also outside ILIFU with Docker, so I think it's not related to the ragavi
disc space issue discussed here. I've opened #1236 to discuss the meqtrees
issue. Let's refocus this one on ragavi
@ ILIFU/Singularity.
Addendum: if I reduce the number of channels from ~8.2k to 100, and run crosscal with the same config file, none of the above issues are present and I get correctly flux scaled output, plotted bandpass gains and no crash. So at least we know they are both due to large data volume and resource limitations, and therefore are connected indeed.
This is very possible I think, because already 8k channels for 64 antennas (I assume) and all the bandpass scans will lead to a really huge HTML file (probably >>200MB). As a consequence, the plot will be un-interactive, if it is produced at all. I could force ragavi to only produce png files in such a case then, maybe this will help with this issue?
Yep @Mulan-94, I agree HTML plots don't make sense at that size. You might even want to switch to datashader in that regime -- the point density is so high that regular scatterplots won't look all that good anyway, but that's a different conversation. For starters, we just need an option (in ragavi, and exposed in the worker) to turn HTML plots on and off.
It still shouldn't fail like that, since the output directory has enough space for any size of plot. Which lends credence to my theory that Bokeh is first writing the HTML to a temp file somewhere under ~
...
For starters, we just need an option (in ragavi, and exposed in the worker) to turn HTML plots on and off.
cool, I'll work on this then
In case it's useful information, keep in mind that none of this happens on Docker. So I think there's still merit in making BP plots even in the case of 8k channels or more. This might just be a Singularity issue. In fact, @viralp said that there's a solution (though we need more details, it's not working for us).
The HTML report for 8k channels is ~138 MB and it's not impossible to navigate. Sure, it's a bit slow, but it's still leagues faster for identifying the source of terrible outliers than making tons of uninteractive png's. So I think there is merit in making it work on ilifu, and as @paoloserra said, it works fine already using Docker.
OK well @Mulan-94 should carry on debugging according to my visionary theory then. :P
The difference between Docker and Singularity is that inside Docker, ~
is the native ~
, while inside Singularity, ~
is a small virtual filesystem baked into the container. So things that like to write to ~
will fail.
but @o-smirnov , do you then understand @viralp 's fix https://github.com/caracal-pipeline/caracal/issues/1200#issuecomment-688732084 ? It's not working for @molnard89 but maybe we're doing something wrong. See https://github.com/caracal-pipeline/caracal/issues/1200#issuecomment-690140705
No, honestly, I don't understand why @viralp's fix worked at all. But then it's not a visionary's job to understand. :P
I remember @SpheMakh struggling with the ~
issue in Singularity so let's ask him to chime in.
Also, let's not forget to try @bennahugo's suggestion above. Add os.environ['HOME'] = os.getcwd()
at the top ofrun.py
in the ragavi cab, to see if it changes things. You may need to pip install -e
Stimela after that.
I've not seen any issues with ~
inside singularity before. It's always refereed to home for me, and I can write to there. Have the SINGULARITY_TMPDIR
, SINGULARITY_LOCALCACHEDIR
and SINGULARITY_CACHEDIR
been exported?
@Jordatious in the sbatch script I'm sourcing a .bashrc file that contains
export SINGULARITY_CACHEDIR=/idia/projects/fornax/.singularity_cache
export SINGULARITY_TMPDIR=/idia/projects/fornax/.singularity_tmp
So I guess we didn't export SINGULARITY_LOCALCACHEDIR
but the rest is there. I'll have another try later adding SINGULARITY_LOCALCACHEDIR
too. Otherwise does this look fine to you?
@molnard89 SINGULARITY_LOCALCACHEDIR
seems to be the relevant one here. The other two seem to relating to when building containers, while SINGULARITY_LOCALCACHEDIR
is a runtime directory that's specifically mentioned as desirable for OpenStack, which is what we're using. So I'd suggest you set this.
I think @Mulan-94 is onto the answer here: https://github.com/ratt-ru/ragavi/issues/78. Told ya boken must have been writing to a temp file at a location of its own choosing.... :)
Great news!
@Mulan-94 can you give me an update on this?
On ILIFU we're with Stimela 1.6.7, but if this is fixed on Stimela master or on a Stimela branch it would be good to know.
Hi @paoloserra , this was fixed already ragavi
, the branch containing the changes in stimela
is :
https://github.com/Mulan-94/Stimela.git@update_ragavi
I already made a pull request and waiting for it to be merged to master.
I think this is fixed now. Please reopen if necessary.
I am getting a similar error as ilifu cluster. I am using the Caracal on Stevie machine. As @gigjozsa pointed out in #625 to rm ~/.stimela/* and softlink the stimela image directory to the user area, I did the same. However, I am getting the following error of OSError: [Errno 28] No space left on device, though I have 5TB space. log-crosscal-plotgains-B-0-0-20200707-231547.txt
Traceback (most recent call last): File "/usr/local/bin/ragavi-gains", line 19, in
main(options=options)
File "/usr/local/lib/python3.6/dist-packages/ragavi/ragavi.py", line 2115, in main
save_html(html_name, final_layout)
File "/usr/local/lib/python3.6/dist-packages/ragavi/ragavi.py", line 1501, in save_html
output = save(plot_layout, name, title=name)
File "/usr/local/lib/python3.6/dist-packages/bokeh/io/saving.py", line 85, in save
_save_helper(obj, filename, resources, title, template)
File "/usr/local/lib/python3.6/dist-packages/bokeh/io/saving.py", line 150, in _save_helper
f.write(html)
OSError: [Errno 28] No space left on device
2020-07-08 01:51:58: ragavi-gains exited with code 1
cd /data3/vparekh/Saraswati/.stimela_workdir-1594164649989506 && singularity returns error code 1
job failed at 2020-07-08 03:51:58.368071 after 0:03:27.178748
Traceback (most recent call last):
File "/home/vparekh/Stimela/stimela/recipe.py", line 693, in run
job.run_job()
File "/home/vparekh/Stimela/stimela/recipe.py", line 418, in run_job
self.job.run(output_wrangler=self.apply_output_wranglers)
File "/home/vparekh/Stimela/stimela/singularity.py", line 128, in run
env=self._env, logfile=self.logfile)
File "/home/vparekh/Stimela/stimela/utils/xrun_poll.py", line 189, in xrun
raise StimelaCabRuntimeError("{} returns error code {}".format(command_name, status))
stimela.utils.StimelaCabRuntimeError: cd /data3/vparekh/Saraswati/.stimela_workdir-1594164649989506 && singularity returns error code 1