TODO: fix parallel loading of sample sets

mathematicalmichael commented 5 years ago

related to #333 and perhaps others.

I think I found the source of the bug... line 140 in sample.py:

            os.path.dirname(file_name), "proc{}_0".format(
                os.path.basename(file_name)))):
        return load_sample_set_parallel(file_name, sample_set_name)

Probably should be proc0_{} I'm still mid-investigation, but I wanted to put this down here.

mathematicalmichael commented 5 years ago

However, while that does trigger the conditional statement, the parallel loading fails because the number of processors no longer matches the number of items found by glob.


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-e685ff89c9b6> in <module>
----> 1 s_set = sample.load_sample_set('test_save')

~/work/BET/bet/sample.py in load_sample_set(file_name, sample_set_name, localize)
    140             os.path.dirname(file_name), "proc0_{}".format(
    141                 os.path.basename(file_name)))):
--> 142         return load_sample_set_parallel(file_name, sample_set_name)
    143 
    144     mdat = sio.loadmat(file_name)

~/work/BET/bet/sample.py in load_sample_set_parallel(file_name, sample_set_name)
    212         # otherwise gather the data from mdat and then scatter
    213         # among the processors and update mdat
--> 214         mdat_files_local = comm.scatter(mdat_files)
    215         mdat_local = [sio.loadmat(m) for m in mdat_files_local]
    216         mdat_list = comm.allgather(mdat_local)

mpi4py/MPI/Comm.pyx in mpi4py.MPI.Comm.scatter()

mpi4py/MPI/msgpickle.pxi in mpi4py.MPI.PyMPI_scatter()

mpi4py/MPI/msgpickle.pxi in mpi4py.MPI.Pickle.dumpv()

ValueError: expecting 1 items, got 2

mathematicalmichael commented 5 years ago

I believe the expected behavior (correct me if I'm wrong) is that if I send off work to multiple processors and save at the end, it saves the chunks, then if I load them back in serial, rather than manually stitching the sample sets back together with merge, the conditional statement that is now being triggered should return a fully "merged" sample set that it got from loading from all the processors.

right now it's trying to scatter 2 objects to a single processor and getting a warning back. I will use try/except ValueError to bypass it... fingers crossed.

mathematicalmichael commented 5 years ago

and now that I've managed to sneak around the errors, I'm realizing that load_sample_set_parallel returns nothing. I think once I fix that we'll be good to go.

mathematicalmichael commented 5 years ago

changes made: append instead of extend the list, return loaded_sample_set now load_sample_set in serial will go grab the appropriate parallel files and merge them.

I have only thus far checked for correct merging of _values and will ensure all other attributes are merged correctly before submitting a PR (frankly it might just make it in as part of another PR I'm planning).

How can we write tests to ensure this behavior is consistent/correct?

mathematicalmichael commented 5 years ago

Run the top in parallel, bottom in serial.

import bet.sample as sample
import numpy as np

sset = sample.sample_set(2)
sset.set_distribution()
# np.random.seed(0)
sset.generate_samples(12)
sset.get_values()
print(sset.get_values_local())
sample.save_sample_set(sset, 'test_save')

s_set = sample.load_sample_set('test_save')
# s_set.local_to_global()
print(s_set.get_values())

mathematicalmichael commented 5 years ago

vectors seem to load and merge appropriately but other properties do not. for example, _dim becomes array(2) instead of the original 2 ... seems like an easy fix.

then I have to go deal with the same version of this for discretizations as well. There are a few attributes I introduced that are not vectors that will need to be copied over (likely just save and inherit to/from the root worker).

mathematicalmichael commented 5 years ago

I have to save a scipy.distribution object and I don't think that will quite work with the dictionary saving since it will only take strings/arrays.

I see that new_mdat[sample_set_name + '_sample_set_type'] = str(type(save_set)).split("'")[1] kind of handles this by just saving the string for re-use later on with loading. I think I'll have to do that with re-instantiating the distribution object.

mathematicalmichael commented 5 years ago

So far it does not seem like the fix to the proc typo addressed #333 . Still failing for n>2

mathematicalmichael commented 5 years ago

TODO: create an 'extract information' function that plays nice with all the various distributions we may have to load/save. have had to re-use it twice already (serial and parallel loading of sample sets). Will need to again for predicted/observed in _setup, so it's worth making a function for it... perhaps part of this is also a function that appropriately unpacks dictionary items into new key-entries for the mdat file with format sample_set_name+ "_" + key, and when unpacked, strips of all unnecessary characters. This gets used for the keys in kwds for distributions, and will also get used for setup.

mathematicalmichael commented 5 years ago

addressed by https://github.com/UT-CHG/BET/pull/355, closing now

UT-CHG / BET

TODO: fix parallel loading of sample sets #344