casacore / python-casacore

Python bindings for casacore, a library used in radio astronomy
http://casacore.github.io/python-casacore
GNU Lesser General Public License v3.0
35 stars 22 forks source link

Argument list to long #245

Open tjgalvin opened 1 year ago

tjgalvin commented 1 year ago

Hi all,

A little strange issue popped up that has left me scratching my hand.

I was processing a collection of measurement sets in a pipeline. There is a stage early on that iterates over rows in the data table of a singular measurement set, and updates visbilities after applying a rotation correction, before writing them back out. This happens in a chunking fashion. This code is available here: https://github.com/AlecThomson/FixMS/blob/main/fixms/fix_ms_corrs.py#L264

Recently I was running a hefty series of jobs and stumbled on this error:

Encountered exception during execution:
Traceback (most recent call last):
  File "/scratch3/gal16b/mambaforge/envs/flint/lib/python3.8/site-packages/fixms/fix_ms_corrs.py", line 330, in fix_ms_corrs
    tab.flush()
  File "/scratch3/gal16b/mambaforge/envs/flint/lib/python3.8/site-packages/casacore/tables/table.py", line 557, in flush
    self._flush(recursive)
RuntimeError: FiledesIO::write - write error in /scratch3/gal16b/split/39403/2022-04-14_110035_18.RACS.0748-43.ms/table.f1: Argument list too long

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch3/gal16b/mambaforge/envs/flint/lib/python3.8/site-packages/prefect/engine.py", line 1719, in orchestrate_task_run
    result = await call.aresult()
  File "/scratch3/gal16b/mambaforge/envs/flint/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 292, in aresult
    return await asyncio.wrap_future(self.future)
  File "/scratch3/gal16b/mambaforge/envs/flint/lib/python3.8/site-packages/prefect/_internal/concurrency/calls.py", line 316, in _run_sync
    result = self.fn(*self.args, **self.kwargs)
  File "/scratch3/gal16b/packages/flint/flint/ms.py", line 473, in preprocess_askap_ms
    fix_ms_corrs(
  File "/scratch3/gal16b/mambaforge/envs/flint/lib/python3.8/site-packages/fixms/fix_ms_corrs.py", line 331, in fix_ms_corrs
    start_row += len(data_chunk_cor)
  File "/scratch3/gal16b/mambaforge/envs/flint/lib/python3.8/site-packages/casacore/tables/table.py", line 406, in __exit__
    self.close()
  File "/scratch3/gal16b/mambaforge/envs/flint/lib/python3.8/site-packages/casacore/tables/table.py", line 574, in close
    self._close()
RuntimeError: FiledesIO::write - write error in /scratch3/gal16b/split/39403/2022-04-14_110035_18.RACS.0748-43.ms/table.f1: Argument list too long

I am unsure what to make of this. I have reran my pipeline on a smaller dataset and which included this measurement set and found no issue. Looking at the specific error Argument list too long reads like there was some interaction with a shell when trying to flush the buffers to disk. Like there is a large cp or rm command trying to be executed.

Would you happen to have any insight into this and the underlying behavior of the close and flush of a casacore table? Is there a series of temporary files stored, say, in /dev/shm that are examined or the current working directory? I am at a total loss as to where else to look, and it is not clear to me if this is actually a python-casacore, a casacore or some other related issue.

rtobar commented 1 year ago

The error is coming from https://github.com/casacore/casacore/blob/5a8df94738bdc36be27e695d7b14fe949a1cc2df/casa/IO/FiledesIO.cc#L100-L104. This is a simple write(2) call which in principle shouldn't result in an E2BIG errno value. I suspect the underlying filesystem of /scratch3 (which one is it, do you know?) is complaining about something during the write, resulting in that non-standard error value for write.

tjgalvin commented 1 year ago

Thanks for the quick response!

It is a lustre backed file system, so all bets are off in understanding what is going on with it. I might raise it will Th the tech staff for this HPC then.

If you are satisfied it is nothing funny going on in casacore feel free to close this issue.

Huge thanks again!

On Fri, 11 Aug 2023, 5:42 pm rtobar, @.***> wrote:

The error is coming from https://github.com/casacore/casacore/blob/5a8df94738bdc36be27e695d7b14fe949a1cc2df/casa/IO/FiledesIO.cc#L100-L104. This is a simple write(2) call which in principle shouldn't result in an E2BIG errno value. I suspect the underlying filesystem of /scratch3 (which one is it, do you know?) is complaining about something during the write, resulting in that non-standard error value for write.

— Reply to this email directly, view it on GitHub https://github.com/casacore/python-casacore/issues/245#issuecomment-1674471168, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACQOAJVXNQGYMZYACRIQQKLXUX5AXANCNFSM6AAAAAA3MOYWFU . You are receiving this because you authored the thread.Message ID: @.***>