Open mcabbott opened 11 months ago
Looking back through the blame, I'm guessing the idea was to not materialize large datasets. The easiest path would be to change the docs, but that doesn't solve the OneHotArrays issue. Maybe https://github.com/JuliaML/MLUtils.jl/blob/af7ebeacdf5a5e6e94e0db55a5dc24835ef260e0/src/obsview.jl#L224-L227 is too general and obsview
should be returning ObsView
for more types?
Yes I'm sure not materialising was the goal. However, some views are more useful than others. I wonder if the default should be something like splitobs(ones(1,10); at=0.5)
makes two contiguous views, which are almost as good as Arrays, but splitobs(ones(1,10); at=0.5, shuffle=true)
makes copies?
The OneHotArrays issue could be solved on that side, by changing what view
does. Or more narrowly by changing what obsview
does.
Regardless of solution, the docstring should 100% be updated to mention some type of view is returned with ObsView
being the detault.
The OneHotArrays issue could be solved on that side, by changing what view does
How straightforward do you think this would be? Since https://github.com/FluxML/OneHotArrays.jl/issues/40 is mostly about performance, another idea would be adding a kwarg which controls whether a view or copy is returned.
It's surprising that
splitobs
andDataLoader
make views, when they mention onlygetobs
in their docstrings, which does not:This means that they do not preserve OneHotArrays, which is https://github.com/FluxML/OneHotArrays.jl/issues/40 .
But more generally, perhaps copies are just safer?