Dataset_id (1/2): properly load distributed dataset_id dataset

chime-experiment / ch_util

CHIME utilities

https://chime-experiment.github.io/ch_util

MIT License

3 stars 3 forks source link

Dataset_id (1/2): properly load distributed dataset_id dataset #53

Closed ljgray closed 1 year ago

ljgray commented 1 year ago

This PR adds dataset_id to the list of distributed datasets to load in andata.CorrData, removes the conversion from bytestring to unicode when loading datasets from file, and adds a property to andata.CorrData to access dataset_id in unicode.

This is a bit of a workaround since unicode/bytestring conversion isn't handled for distributed datasets in caput.memh5. I'll change this eventually.

Accompanies chime-experiment/ch_pipeline#175

ljgray commented 1 year ago

This seems like a fine way to resolve this, but it looks like adding "dataset_id" to _DIST_DSETS the way "frac_lost" is would be sufficient. Is there a reason not to do that?

Yeah, that would resolve this issue. The deeper problem is that dataset_id is a unicode type, so if it's distributed it won't get converted to a bytestring in memh5 when saving to .h5 and will throw an error (since hdf5 can't handle unicode types). This approach is needed to let us make dataset_id a common dataset instead of a distributed one. It does seem like a slightly convoluted solution, but I can't come up with another way to fix the underlying issue unless we are able to convert distributed array types before saving.

tristpinsm commented 1 year ago

I see, yeah I guess short of adding that functionality to mpiarray there isn't another obvious way to do this, and this is going to remain a convoluted structure anyways...

jrs65 commented 1 year ago

I guess ultimately I'm not clear why we can't just fix the restriction in the MPIArray output.

ljgray commented 1 year ago

I guess ultimately I'm not clear why we can't just fix the restriction in the MPIArray output.

I guess I might be fundamentally misunderstanding the issue. There's nothing in our codebase that I can clearly see restricting us from just converting distributed datasets in the same way as we do with non-distribued ones, so I assumed it was something related to hdf5 that blocked us from doing that.

Or, do you mean doing the type conversion from unicode to bytestring in MPIArray.to_file?

ljgray commented 1 year ago

@jrs65 This should implement the changes that we talked about on Friday now.