enram / vptstools

Python library to transfer and convert vertical profile time series data
https://enram.github.io/vptstools/
MIT License
3 stars 1 forks source link

bug in multiprocessing `vptstools.vpts.vpts`, transfer of file from parent to child process fails? #80

Closed PietrH closed 2 months ago

PietrH commented 2 months ago

On a recent run on the HPC I noticed a bunch of Python errors in my error logs. After some searching I believe these might be due to a bug in the multiprocessing implementation of vptstools.vpts. This code was running on Tier-1 supercomputer infrastructure, so it's all a bit different to what I'm used to on a local system, sorry if this makes this more difficult to diagnose.

This is what I'm seeing (just a sample, the log file is too big for Github):

2024-06-06 01:34:52.181265 [WARN] skipping due to Python error: Can't synchronously read data (inflate() failed)
2024-06-06 02:06:15.482963 [WARN] skipping due to Python error: [Errno 2] Unable to synchronously open file (unable to open file: name = '/dodrio/scratch/projects/starting_2024_022/vsc47414/data/scratch/fiika_vpts_20210402/fiika_vp_20210402T092500Z_0xb.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2024-06-06 02:06:52.060801 [WARN] skipping due to Python error: Unable to synchronously open file (truncated file: eof = 24576, sblock->base_addr = 0, stored_eof = 25453)
2024-06-06 02:26:11.045027 [WARN] skipping due to Python error: [Errno 2] Unable to synchronously open file (unable to open file: name = '/dodrio/scratch/projects/starting_2024_022/vsc47414/data/scratch/fikor_vpts_20221123/fikor_vp_20221123T221500Z_0xb.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2024-06-06 02:39:54.717665 [WARN] skipping due to Python error: Unable to get group info (wrong B-tree signature)
2024-06-06 03:27:26.794121 [WARN] skipping due to Python error: [Errno 2] Unable to synchronously open file (unable to open file: name = '/dodrio/scratch/projects/starting_2024_022/vsc47414/data/scratch/fivan_vpts_20200308/fivan_vp_20200308T023000Z_0xb.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2024-06-06 03:35:33.765447 [WARN] skipping due to Python error: 'Unable to synchronously open object (bad object header version number)'
2024-06-06 03:35:36.416813 [WARN] skipping due to Python error: [Errno 2] Unable to synchronously open file (unable to open file: name = '/dodrio/scratch/projects/starting_2024_022/vsc47414/data/scratch/fivih_vpts_20221218/fivih_vp_20221218T155500Z_0xb.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2024-06-06 04:09:23.550175 [WARN] skipping due to Python error: Can't synchronously read data (inflate() failed)
2024-06-06 04:09:26.415044 [WARN] skipping due to Python error: [Errno 2] Unable to synchronously open file (unable to open file: name = '/dodrio/scratch/projects/starting_2024_022/vsc47414/data/scratch/frabb_vpts_20210120/frabb_vp_20210120T170000Z_0xb.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2024-06-06 04:31:55.885879 [WARN] skipping due to Python error: [Errno 2] Unable to synchronously open file (unable to open file: name = '/dodrio/scratch/projects/starting_2024_022/vsc47414/data/scratch/fraja_vpts_20201127/fraja_vp_20201127T050500Z_0xb.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2024-06-06 04:40:13.225179 [WARN] skipping due to Python error: [Errno 2] Unable to synchronously open file (unable to open file: name = '/dodrio/scratch/projects/starting_2024_022/vsc47414/data/scratch/frale_vpts_20230209/frale_vp_20230209T031000Z_0xb.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2024-06-06 04:41:53.070725 [WARN] skipping due to Python error: Unable to synchronously open file (truncated file: eof = 24576, sblock->base_addr = 0, stored_eof = 25504)
2024-06-06 04:47:43.273036 [WARN] skipping due to Python error: 'NoneType' object is not subscriptable

...

2024-06-06 08:55:14.361207 [WARN] skipping due to Python error: [Errno 2] Unable to synchronously open file (unable to open file: name = '/dodrio/scratch/projects/starting_2024_022/vsc47414/data/scratch/frniz_vpts_20210219/frniz_vp_20210219T204000Z_0xb.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)
2024-06-06 09:12:44.435226 [WARN] skipping due to Python error: Can't synchronously read data (wrong B-tree signature)

There is thousands of these in a large run.

I'm passing python code via reticulate, this is what I'm doing:

from pathlib import Path
from vptstools.vpts import vpts

# Get all HDF5 files within the data directory
file_paths = sorted(Path("{scratch_subdir}").rglob("*.h5"))
df_vpts = vpts(file_paths)

The code is running within the worker framework: https://worker.readthedocs.io/en/latest/ so the logs above are actually from different parallel executions based on a csv file with a list of parameters that are being fed via the CLI. As you can tell, there are also some instances where Python is trying to access the same file in parallel, that was my bad, and has been fixed. It's especially the wrong B-tree signature, No such file or directory and inflate() that worry me.

Originally I believed this might be due to corrupted hdf5 files, but now I believe it might be due to a bug in vptstools due to this comment: https://github.com/vaexio/vaex/issues/625#issuecomment-1007414088, the whole thread is worth reading: https://github.com/vaexio/vaex/issues/625

If you need more information on what exactly I was running, let me know and I'll send you a copy of all the tooling, input files, full logs, etc.

In any case, I did check one (1) of the output files where I was seeing these errors for the input files, and it seems fine, and is checksum identical to the file I can generate locally (without Python errors). So I'm not too alarmed. But I'd still love a second opinion, and it would be a lot of work to actually check all of the files I have these python warnings for in the log.

I'm trying to keep an eye on it for future runs as well!

PietrH commented 2 months ago

These errors have become much more rare now I'm taking more care not to write to the same file with multiple workers, but I'm still getting synchronicity errors.

2024-06-11 11:59:33.668407 [WARN] skipping due to Python error: Unable to synchronously open file (truncated file: eof = 12288, sblock->base_addr = 0, stored_eof = 25694)
Warning message:
In py_run_string_impl(code, local, convert) :
  OSError: Unable to synchronously open file (truncated file: eof = 12288, sblock->base_addr = 0, stored_eof = 25694)
Run `reticulate::py_last_error()` for details.
2024-06-11 12:04:25.794271 [WARN] skipping due to Python error: Unable to synchronously open file (truncated file: eof = 4096, sblock->base_addr = 0, stored_eof = 25432)

Maybe these are unrelated?

On my end, there should never be multiple workers trying to open the same file at the same time, with the current workflow, every hdf5 file is only opened once in the whole run.

PietrH commented 2 months ago

On second viewing, this error seems to originate from the OS/filesystem, so probably unrelated to vptstools

In that case, this was fixed by me removing overlapping date intervals/file sequences in my worker input data file. And this issue can be closed.