Closed justinrporter closed 7 months ago
Hi @justinrporter,
Sorry for the delay- I started poking at this today. As of numpy 1.24.0 ragged array creation without specifying dtype=object
results in a ValueError
.
E.g. In a fresh environment with python 3.9 (3.9.19) and numpy 1.26.4:
import numpy as np
from enspara import ra
a = [np.arange(i) for i in range(1,6)]
b = ra.RaggedArray(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/export/seas-home/justinjm/software/enspara/enspara/ra/ra.py", line 502, in __init__
array = np.array(list(array))
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.
Converting the RA creation to: np.array(list(array),dtype=object)
will let this proceed. The nested arrays retain their dtype (e.g. int64 in this case), but the entire array becomes an object array. I imagine switching all array creations to np.array(___, dtype=object)
could fix our issues, but there will be loads of other sanity checks that need to be rewritten.
I'm not familiar enough with the rest of enspara's ra handling- do we think making everything object arrays will result in a huge performance loss? @lgsmith and I were chatting about this earlier- do you know how much performance we gain by putting this into a np array as opposed to leaving it as a list?
Less immediately related, in the above environment I get the following after running nosetests enspara
: 2 skips, 66 errors, and 2 failures.
~33 errors are the above ValueErrors.
~33 errors are deprecations of np.int and np.bool (suggested to replace with int
or bool
unless wishing to specify precision (int) or scalar type (bool, alternatively use np.bool_
. I've got a version of this commit now with all instances of np.bool
replaced with bool
and all instances of np.int
replaced with int
.
Happy to push more on this but wanted to see what the hive mind had to say about handling these two errors.
To the original post though, if I downgrade numpy to 1.23 I can run the tests with only 3 failures, all related to MPI.
I haven't dug too deep into it, but the first is a AssertionError: Arrays are not equal
, and the latter two are files don't exist errors.
Happy to dig deeper!
I’m inferring from this that the switches away from arrays called on ragged list comps is not in this pr. I think object arrays may still be better than lists of arrays based on my reading around online, though both should be similar. I have a branch where I fixed that issue for myself…happy to turn it into a pr. Should be an easy merge it’s only a change in a few places.
On Fri, Apr 5, 2024 at 9:56 AM Justin Miller @.***> wrote:
To the original post though, if I downgrade numpy to 1.23 I can run the tests with only 3 failures, all related to MPI.
I haven't dug too deep into it, but the first is a AssertionError: Arrays are not equal, and the latter two are files don't exist errors.
Happy to dig deeper!
— Reply to this email directly, view it on GitHub https://github.com/bowman-lab/enspara/pull/215#issuecomment-2040020073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAZVYLLZU6SWQBQWO2OSYLY323RLAVCNFSM5NVP5FHKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGAYDEMBQG4ZQ . You are receiving this because you were mentioned.Message ID: @.***>
@lgsmith Correct- there are no changes to ra in this PR.
Would be great to merge the two. I think there are is still an instance of arrays called on ragged lists in the branch you'd linked me the other day (https://github.com/lgsmith/enspara/tree/fix-tm-bug). Traceback from running the tests on your branch with python 3.10, numpy 1.26:
Traceback (most recent call last):
File "/home/jjmiller/test/enspara/enspara/test/test_ra.py", line 204, in test_RaggedArray_negative_slicing
assert_array_equal(a[:,:-1].lengths, np.array([9, 4, 4]))
File "/home/jjmiller/test/enspara/enspara/ra/ra.py", line 608, in __getitem__
iis, new_lengths = _get_iis_from_slices(
File "/home/jjmiller/test/enspara/enspara/ra/ra.py", line 452, in _get_iis_from_slices
iis_2d = np.array(
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
Fixing would close #219
do we think making everything object arrays will result in a huge performance loss? @lgsmith and I were chatting about this earlier- do you know how much performance we gain by putting this into a np array as opposed to leaving it as a list?
I'm not 100% sure I understand what's going on here, but in the interest of giving you guys some thoughts, here we go:
The main time that a performance gain is realized by having the "backing array" (ie the internal representation of the ra) being a single flat array is when the non-ragged dimension is very large and the ragged dimension is very small. Imagine a RA with thousands of rows, each 1-10 elements long. Doing the iteration in pure python is slow, but iteration through the concatenated backing array is fast.
Taking a step back, won't doing this break most of the underlying code on the ragged array? Because it keeps an internal representation of which row each column in the backing array belongs to and re-writes stuff as appropriate... @mizimmer90 may also have thoughts/opinions.
Also also, did this branch never get merged? It's just been sitting around for 28 months?
In this case, the input ragged array is actually duplicated and stored as a numpy array in addition to our format of a single 1D array. We did this as a quick way to make use of numpys fancy indexing in the first dimension if a second dimension of indices is not specified. I think making this an object should be okay for slicing but we'd need to make sure it doesn't mess with the dtypes of the rest of the values stored (I don't think it would but just want to caveat!).
Also, we don't necessarily need to store this duplicate as a numpy array at all if we add some extra logic for fancy indexing in a single dimension with lists!
Got it- I'd vote we merge this branch to bring us up to numpy<1.24, python < 3.10.
@lgsmith would you like me to poke at the ra fixes above or do you want to make the extra changes to your branch?
I’m happy to do it. I think I’d like to check up on the dtype behavior too, but I found that the dtype of the arrays within an object array is preserved when I was playing around with this before.
On Mon, Apr 8, 2024 at 6:43 AM Justin Miller @.***> wrote:
Got it- I'd vote we merge this branch to bring us up to numpy<1.24, python < 3.10.
@lgsmith https://github.com/lgsmith would you like me to poke at the ra fixes above or do you want to make the extra changes to your branch?
— Reply to this email directly, view it on GitHub https://github.com/bowman-lab/enspara/pull/215#issuecomment-2042422745, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAZVYPWQLXDD5WW65B7PRDY4JYD3AVCNFSM5NVP5FHKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGI2DEMRXGQ2Q . You are receiving this because you were mentioned.Message ID: @.***>
The MPI tests are still broken for me. I don't really know much about MPI/mpi4py, and the tests actually hang without producing error msgs (they are using processor, but I think it's spinning fruitlessly because I left it running for 10 minutes and it never got through that one test). If anyone has any insight on how to deal with MPI issues that'd be helpful.
This code works for me in a clean test environment with python 3.9 and 3.10. It's worth noting that in the past ~30h installing tables
with conda/mamba seems to miss the binaries so I haven't been able to test in broader environments.
I can try poking at the mpi tests tomorrow. I'm fine with merging this in and making the MPI tests a separate issue/pull if that works for others!
Fine with me.
On Wed, Apr 10, 2024 at 2:27 PM Justin Miller @.***> wrote:
This code works for me in a clean test environment with python 3.9 and 3.10. It's worth noting that in the past ~30h installing tables with conda/mamba seems to miss the binaries so I haven't been able to test in broader environments.
I can try poking at the mpi tests tomorrow. I'm fine with merging this in and making the MPI tests a separate issue/pull if that works for others!
— Reply to this email directly, view it on GitHub https://github.com/bowman-lab/enspara/pull/215#issuecomment-2048197713, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAZVYOUAKSLBTIM3GLJ763Y4WACFAVCNFSM5NVP5FHKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUHAYTSNZXGEZQ . You are receiving this because you were mentioned.Message ID: @.***>
This PR fixes a number of tests that were broken under newer versions of our various libraries as well as in python 3.9.
@lgsmith could you run the tests in this branch and verify?
md.load
(bug is now fixed inmaster
but not released)implied_timescales.infer_timestep
that resulted from MDTraj changing the error type it raises when a file is not a valid type formd.load
fromValueError
toOSError
(we now catch both)setup.py
to python 3.9exposons
code to provide more information when aRecursionError
is reached bytop.select
. I only encountered this error in testing, I couldn't reproduce it outside of a test runner (both nose and pytest), so I'm not digging it too much more.np.float
tofloat
). In numpy 1.20,np.float
will no longer be an alias for the builtinfloat
so we are just retaining old behavior with this.random_state
toAffinityPropagation
which caused test reversions. We now userandom_state=0
to maintain results that match the publication.scikit-learn>=0.23.3
(see above)DummyComm
anddummy_mpi4py
objects to handle running the tests without mpi4py installed. Some tests still error out, though.