JuliaStats / NullableArrays.jl

DEPRECATED Prototype of the new JuliaStats NullableArrays package
Other
35 stars 21 forks source link

best way not to lift #193

Open ExpandingMan opened 7 years ago

ExpandingMan commented 7 years ago

Suppose I don't want to lift. Is there something like, for instance,

@nolift f.(v)

Since both broadcast! and map are overridden, the only way of handling this currently, as far as I can tell, is to create a wrapper function that runs your inner function through a "manual" loop.

nalimilan commented 7 years ago

The other solution is to define methods like in https://github.com/JuliaStats/NullableArrays.jl/pull/185.

Can you develop your use case for not lifting?

ExpandingMan commented 7 years ago

I think #185 is a good example.

Suppose I want to run through a dataframe, determine which rows are null, and then do something. For example

f(x) = isnull(x) ? false : (get(x) % 3 == 0)

Admittedly, at the moment I cannot think of a use which cannot be achieved using lifting along with isnull if it returns a Vector{Bool} (which currently it doesn't do), but I think in some cases it's just easier for the user to write a function like the above rather than dealing with isnull.

Just thinking out loud here

f.(x) .& isnull.(x)

wouldn't work even if isnull does return a Vector{Bool}, so I'm not quite seeing an alternative as simple as writing an f(::Nullable) and then doing f.(v), and that's in the case where return values are simple Bools, suppose they were more complicated objects.

One of the main use cases for null handling in which you really don't want to just return Nullable is to replace the nulls with samples from a distribution. However, I suppose in such a case you really have to write a "manual" loop anyway.

ararslan commented 7 years ago

This is one of my concerns with automatic lifting. I'm hoping that with the planned Nullable revamp in Julia 1.0, map and broadcast will not lift. IMO it doesn't make sense to use those functions on Nullables at all. But that's an aside, not to mention it doesn't help us in the short-term (0.5 and 0.6)...

To answer your question, to apply a function elementwise without lifting, you could use a comprehension, e.g. [f(x) for x in X]. This can't be applied in-place though.

nalimilan commented 7 years ago

In theory it's annoying that we don't provide a way to disable lifting, but in practice I'd like to see a real use case before marking this issue as something in need to be fixed in the short term. As @ararslan noted, I hope the nullable rework will help moving beyond this.

ExpandingMan commented 7 years ago

Hm. I'm trying to think if there is some deeper reason why you should never lift. For the record, if isnull returns an AbstractVector{Bool}, to replace nulls you can do something like

v[find(isnull.(v))] .= rand(sum(isnull.(v)))

which is pretty damn elegant if you ask me.

However, I keep coming back to the example of "selecting" rows. In my original example, the task was "Let me select elements defined by f, excluding nulls". That seems pretty simple, but I still am not seeing a simple way of doing that. That is most definitely something I've had to do before and will have to do again. Is there some simple way of doing it that I can't see?

I'm a pretty big proponent of trying to get Nullable (or whatever the "data null type" is to be called) to behave as similar to NaN as possible (for years and years I only ever had to worry about NaNs and, not only was it never a problem, but I never even had to give it much thought), so the idea of always lifting does have a certain appeal to me. However, the mere fact that !(Nullable{T}() isa T) makes Nullables very different from NaNs, and I don't see any way of changing that.

Edit: By the way, I know this isn't really the appropriate place to ask, but are there still plans to split Nullable into two types (one for data where the default behavior is lifting, and another which doesn't get lifted by default)? Because that seemed to me like a pretty good idea.

ararslan commented 7 years ago

Yes, a general idea would be to have two separate null-like concepts, one like NA in DataArrays (modeled after its namesake in R and behaves like a scalar) and one like Option in Rust (behaves like a collection of 0 or 1 elements, similar to current Nullable but hopefully without the arithmetic).

There are querying frameworks for selecting rows, or you should be able to do something like dt[[f(x) for x in dt[:column]], :] for dt a DataTable.

ExpandingMan commented 7 years ago

I take it the new NA would be a wrapper like Nullable so as to avoid the current type stability issues with DataArrays?

ararslan commented 7 years ago

No, it will be a Union as it is now, but there are planned optimizations for Unions in 1.0.