bowman-lab / enspara

Modeling molecular ensembles with scalable data structures and parallel computing
https://enspara.readthedocs.io
GNU General Public License v3.0
33 stars 16 forks source link

Fix tests jan 2022 #215

Closed justinrporter closed 7 months ago

justinrporter commented 2 years ago

This PR fixes a number of tests that were broken under newer versions of our various libraries as well as in python 3.9.

@lgsmith could you run the tests in this branch and verify?

Justin-J-Miller commented 7 months ago

Hi @justinrporter,

Sorry for the delay- I started poking at this today. As of numpy 1.24.0 ragged array creation without specifying dtype=object results in a ValueError.

E.g. In a fresh environment with python 3.9 (3.9.19) and numpy 1.26.4:

import numpy as np
from enspara import ra

a = [np.arange(i) for i in range(1,6)]
b = ra.RaggedArray(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/export/seas-home/justinjm/software/enspara/enspara/ra/ra.py", line 502, in __init__
    array = np.array(list(array))
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.

Converting the RA creation to: np.array(list(array),dtype=object) will let this proceed. The nested arrays retain their dtype (e.g. int64 in this case), but the entire array becomes an object array. I imagine switching all array creations to np.array(___, dtype=object) could fix our issues, but there will be loads of other sanity checks that need to be rewritten.

I'm not familiar enough with the rest of enspara's ra handling- do we think making everything object arrays will result in a huge performance loss? @lgsmith and I were chatting about this earlier- do you know how much performance we gain by putting this into a np array as opposed to leaving it as a list?

Less immediately related, in the above environment I get the following after running nosetests enspara: 2 skips, 66 errors, and 2 failures. ~33 errors are the above ValueErrors. ~33 errors are deprecations of np.int and np.bool (suggested to replace with int or bool unless wishing to specify precision (int) or scalar type (bool, alternatively use np.bool_. I've got a version of this commit now with all instances of np.bool replaced with bool and all instances of np.int replaced with int.

Happy to push more on this but wanted to see what the hive mind had to say about handling these two errors.

Justin-J-Miller commented 7 months ago

To the original post though, if I downgrade numpy to 1.23 I can run the tests with only 3 failures, all related to MPI.

I haven't dug too deep into it, but the first is a AssertionError: Arrays are not equal, and the latter two are files don't exist errors.

Happy to dig deeper!

lgsmith commented 7 months ago

I’m inferring from this that the switches away from arrays called on ragged list comps is not in this pr. I think object arrays may still be better than lists of arrays based on my reading around online, though both should be similar. I have a branch where I fixed that issue for myself…happy to turn it into a pr. Should be an easy merge it’s only a change in a few places.

On Fri, Apr 5, 2024 at 9:56 AM Justin Miller @.***> wrote:

To the original post though, if I downgrade numpy to 1.23 I can run the tests with only 3 failures, all related to MPI.

I haven't dug too deep into it, but the first is a AssertionError: Arrays are not equal, and the latter two are files don't exist errors.

Happy to dig deeper!

— Reply to this email directly, view it on GitHub https://github.com/bowman-lab/enspara/pull/215#issuecomment-2040020073, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAZVYLLZU6SWQBQWO2OSYLY323RLAVCNFSM5NVP5FHKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGAYDEMBQG4ZQ . You are receiving this because you were mentioned.Message ID: @.***>

Justin-J-Miller commented 7 months ago

@lgsmith Correct- there are no changes to ra in this PR.

Would be great to merge the two. I think there are is still an instance of arrays called on ragged lists in the branch you'd linked me the other day (https://github.com/lgsmith/enspara/tree/fix-tm-bug). Traceback from running the tests on your branch with python 3.10, numpy 1.26:

Traceback (most recent call last):
  File "/home/jjmiller/test/enspara/enspara/test/test_ra.py", line 204, in test_RaggedArray_negative_slicing
    assert_array_equal(a[:,:-1].lengths, np.array([9, 4, 4]))
  File "/home/jjmiller/test/enspara/enspara/ra/ra.py", line 608, in __getitem__
    iis, new_lengths  = _get_iis_from_slices(
  File "/home/jjmiller/test/enspara/enspara/ra/ra.py", line 452, in _get_iis_from_slices
    iis_2d = np.array(
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

Fixing would close #219

justinrporter commented 7 months ago

do we think making everything object arrays will result in a huge performance loss? @lgsmith and I were chatting about this earlier- do you know how much performance we gain by putting this into a np array as opposed to leaving it as a list?

I'm not 100% sure I understand what's going on here, but in the interest of giving you guys some thoughts, here we go:

The main time that a performance gain is realized by having the "backing array" (ie the internal representation of the ra) being a single flat array is when the non-ragged dimension is very large and the ragged dimension is very small. Imagine a RA with thousands of rows, each 1-10 elements long. Doing the iteration in pure python is slow, but iteration through the concatenated backing array is fast.

Taking a step back, won't doing this break most of the underlying code on the ragged array? Because it keeps an internal representation of which row each column in the backing array belongs to and re-writes stuff as appropriate... @mizimmer90 may also have thoughts/opinions.

Also also, did this branch never get merged? It's just been sitting around for 28 months?

mizimmer90 commented 7 months ago

In this case, the input ragged array is actually duplicated and stored as a numpy array in addition to our format of a single 1D array. We did this as a quick way to make use of numpys fancy indexing in the first dimension if a second dimension of indices is not specified. I think making this an object should be okay for slicing but we'd need to make sure it doesn't mess with the dtypes of the rest of the values stored (I don't think it would but just want to caveat!).

Also, we don't necessarily need to store this duplicate as a numpy array at all if we add some extra logic for fancy indexing in a single dimension with lists!

Justin-J-Miller commented 7 months ago

Got it- I'd vote we merge this branch to bring us up to numpy<1.24, python < 3.10.

@lgsmith would you like me to poke at the ra fixes above or do you want to make the extra changes to your branch?

lgsmith commented 7 months ago

I’m happy to do it. I think I’d like to check up on the dtype behavior too, but I found that the dtype of the arrays within an object array is preserved when I was playing around with this before.

On Mon, Apr 8, 2024 at 6:43 AM Justin Miller @.***> wrote:

Got it- I'd vote we merge this branch to bring us up to numpy<1.24, python < 3.10.

@lgsmith https://github.com/lgsmith would you like me to poke at the ra fixes above or do you want to make the extra changes to your branch?

— Reply to this email directly, view it on GitHub https://github.com/bowman-lab/enspara/pull/215#issuecomment-2042422745, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAZVYPWQLXDD5WW65B7PRDY4JYD3AVCNFSM5NVP5FHKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGI2DEMRXGQ2Q . You are receiving this because you were mentioned.Message ID: @.***>

lgsmith commented 7 months ago

The MPI tests are still broken for me. I don't really know much about MPI/mpi4py, and the tests actually hang without producing error msgs (they are using processor, but I think it's spinning fruitlessly because I left it running for 10 minutes and it never got through that one test). If anyone has any insight on how to deal with MPI issues that'd be helpful.

Justin-J-Miller commented 7 months ago

This code works for me in a clean test environment with python 3.9 and 3.10. It's worth noting that in the past ~30h installing tables with conda/mamba seems to miss the binaries so I haven't been able to test in broader environments.

I can try poking at the mpi tests tomorrow. I'm fine with merging this in and making the MPI tests a separate issue/pull if that works for others!

lgsmith commented 7 months ago

Fine with me.

On Wed, Apr 10, 2024 at 2:27 PM Justin Miller @.***> wrote:

This code works for me in a clean test environment with python 3.9 and 3.10. It's worth noting that in the past ~30h installing tables with conda/mamba seems to miss the binaries so I haven't been able to test in broader environments.

I can try poking at the mpi tests tomorrow. I'm fine with merging this in and making the MPI tests a separate issue/pull if that works for others!

— Reply to this email directly, view it on GitHub https://github.com/bowman-lab/enspara/pull/215#issuecomment-2048197713, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABAZVYOUAKSLBTIM3GLJ763Y4WACFAVCNFSM5NVP5FHKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUHAYTSNZXGEZQ . You are receiving this because you were mentioned.Message ID: @.***>