Closed ljgray closed 1 year ago
This seems like a fine way to resolve this, but it looks like adding
"dataset_id"
to_DIST_DSETS
the way"frac_lost"
is would be sufficient. Is there a reason not to do that?
Yeah, that would resolve this issue. The deeper problem is that dataset_id
is a unicode type, so if it's distributed it won't get converted to a bytestring in memh5
when saving to .h5
and will throw an error (since hdf5 can't handle unicode types). This approach is needed to let us make dataset_id
a common dataset instead of a distributed one. It does seem like a slightly convoluted solution, but I can't come up with another way to fix the underlying issue unless we are able to convert distributed array types before saving.
I see, yeah I guess short of adding that functionality to mpiarray
there isn't another obvious way to do this, and this is going to remain a convoluted structure anyways...
I guess ultimately I'm not clear why we can't just fix the restriction in the MPIArray output.
I guess ultimately I'm not clear why we can't just fix the restriction in the MPIArray output.
I guess I might be fundamentally misunderstanding the issue. There's nothing in our codebase that I can clearly see restricting us from just converting distributed datasets in the same way as we do with non-distribued ones, so I assumed it was something related to hdf5 that blocked us from doing that.
Or, do you mean doing the type conversion from unicode to bytestring in MPIArray.to_file
?
@jrs65 This should implement the changes that we talked about on Friday now.
This PR adds
dataset_id
to the list of distributed datasets to load inandata.CorrData
, removes the conversion from bytestring to unicode when loading datasets from file, and adds a property toandata.CorrData
to accessdataset_id
in unicode.This is a bit of a workaround since unicode/bytestring conversion isn't handled for distributed datasets in
caput.memh5
. I'll change this eventually.Accompanies chime-experiment/ch_pipeline#175