Open AlecThomson opened 3 years ago
I would be surprised of something in dask-mpi
was causing this directly. My first instinct would be to look at python environments. Specifically comparing what happens differently between using dask-mpi
and SlurmCluster
.
Perhaps a good first step would be to check the sys.executable
on clusters submitted by both methods to ensure it is the same.
Thanks, @jacobtomlinson. I added the line:
print('sys.exec', sys.executable)
to the worker function.
Using both dask-mpi
and a LocalCluster
I get:
sys.exec /group/askap/athomson/miniconda3/envs/spice/bin/python3.8
which is the conda env I'd expect to be called
Your example code works perfectly for me, with astropy = 4.3.1, dask_mpi = 2.21.0, dask = 2021.08.0, Python = 3.8.10 and Ubuntu 20.04.2 LTS. I get both the degrees and hour strings out as expected.
I will note that the line numbers for the errors do not correspond to the version of astropy that I have installed.
Thanks, @ste616. The same is true for me as well, actually. The MCVE also runs fine for me, which is part of my confusion. There might be some conflict between that and my full working script, but for the life of me I can't see what it is.
EDIT: I've also found that the above problem (with u.hourangle) persists even when the function is not delayed (but still using dask-mpi)
As an update, it looks like the issue extends to other parts of astropy.units
. A script like:
from distributed import Client
from dask_mpi import initialize
from dask import delayed
from astropy.coordinates import Angle
import astropy.units as u
import time
import numpy as np
@delayed
def worker(freq):
freq_arr = freq.to(u.Hz).value
return freq_arr
def main():
initialize(interface='ipogif0')
client = Client()
results = []
for i in range(100):
freq = np.arange(100) * u.Hz
results.append(
worker(freq)
)
futures = client.persist(results)
outputs = [f.compute() for f in futures]
print('outputs is',outputs)
if __name__ == "__main__":
main()
Raises astropy.units.core.UnitConversionError: 'Hz' (frequency) and 'Hz' (frequency) are not convertible
. Again, I should note that this MCVE doesn't reproduce this Error which occurs in my full script.
I can workaround this by doing the unit conversion in main
e.g.
from distributed import Client
from dask_mpi import initialize
from dask import delayed
from astropy.coordinates import Angle
import astropy.units as u
import time
import numpy as np
@delayed
def worker(freq):
freq_arr = freq
return freq_arr
def main():
initialize(interface='ipogif0')
client = Client()
results = []
for i in range(100):
freq = np.arange(100) * u.Hz
results.append(
worker(freq.to(u.Hz).value)
)
futures = client.persist(results)
outputs = [f.compute() for f in futures]
print('outputs is',outputs)
if __name__ == "__main__":
main()
The full traceback is:
```
Traceback (most recent call last):
File "/group/askap/athomson/miniconda3/envs/spice/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
new_state = method(self, state, *args, **kwargs)
File "/group/askap/athomson/miniconda3/envs/spice/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 865, in get_task_run_state
value = prefect.utilities.executors.run_task_with_timeout(
File "/group/askap/athomson/miniconda3/envs/spice/lib/python3.8/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout
return task.run(*args, **kwargs) # type: ignore
File "/group/askap/athomson/repos/spiceracs/spiceracs/processSPICE.py", line 64, in frion_task
return frion.main(
File "/group/askap/athomson/repos/spiceracs/spiceracs/frion.py", line 218, in main
updates = [f.compute() for f in futures]
File "/group/askap/athomson/repos/spiceracs/spiceracs/frion.py", line 218, in
EDIT 2:
If I don't delay the function with the coordinate/hourangle issue, the same error occurs if I use LocalCluster
. Delaying it allows it to work with LocalCluster
. It fails in either case with dask-mpi
.
The inconsistency is strange. Are you definitely using 2021.05.0
everywhere?
I'm pretty sure that's the case. I'm using a locally installed conda environment. As a test I added:
print("I'm in the {func} function!",'dask.__version__', dask.__version__)
print("I'm in the {func} function!",'dask.__file__', dask.__file__)
And I get:
I'm in the main function! dask.__version__ 2021.05.0
I'm in the main function! dask.__file__ /group/askap/athomson/miniconda3/envs/spice/lib/python3.8/site-packages/dask/__init__.py
I'm in the worker function! dask.__version__ 2021.05.0
I'm in the worker function! dask.__file__ /group/askap/athomson/miniconda3/envs/spice/lib/python3.8/site-packages/dask/__init__.py
And distributed
too? (They should be pinned but it's worth checking).
Here's a test with distributed
I'm in the main function! distributed.__version__ 2021.05.0
I'm in the main function! distributed.__file__ /group/askap/athomson/miniconda3/envs/spice/lib/python3.8/site-packages/distributed/__init__.py
I'm in the worker function! distributed.__version__ 2021.05.0
I'm in the worker function! distributed.__file__ /group/askap/athomson/miniconda3/envs/spice/lib/python3.8/site-packages/distributed/__init__.py
As small aside, I noticed I had an inconsistency in my module importing -- i.e.
import distributed
vs
from dask import distributed
I corrected to just use the former option, but the issue persists.
I noticed I had an inconsistency in my module importing
The latter is preferred, but either will mostly be fine.
Again, I should note that this MCVE doesn't reproduce this Error which occurs in my full script.
Without a reproducer, I'm afraid this will be hard for us to track down.
What happened: This may be an issue in astropy, so my apologies if this is in the wrong location. Although, this appears to only happen using
dask-mpi
.I'm using
dask-mpi
to distribute a task with using aastropy.coordinates.Angle
object. When I try to convert to theastropy.units.hourangle
format, I get the following error:What you expected to happen: Using a
LocalCluster
and aSLURMcluster
I do not get this error using otherwise identical code. Further,astropy.coordinate.Angle
explicitly has anhour
property (L161):Further, if I do (see MCVE):
I get
True
! Something strange seems to be happening when I try to access the propety itself. I'll note something similar happened withcoordinates.SkyCoord
and it'shms
property.Minimal Complete Verifiable Example: This is as close to a minimal setup as my working script. Very frustratingly, the MCVE does not produce the same error. Hair pulling abounds.
Anything else we need to know?:
Environment: