Rework "NA" (i.e., null, invalid) handling in PropertyViews

arthurp commented 3 years ago

Currently handling of Arrow invalid values (also called nulls and NAs) in PropertyViews, is ad hoc and probably not what people would expect. Currently reading the valid bit works (IsValid), but the valid bits are never updated even when the user writes to the property data.

We don't really expect to use NAs much at all and there are lots of tools to eliminate NAs (e.g., in pandas). So my approach would be to remove all support for NAs and add checks to fail as early as possible if an NA is encountered. Arrow has specific support for arrays that never contain NA; they are represented with no null_bitmap. We should be able to use that to eliminate many dangerous cases.

The main reason we may need to do this is that in the python ecosystem the arrow arrays we use may originate from other libraries that has NAs (like pandas) and those NAs will be maintained by pyarrow. So we need to be aware of them and provide useful errors so users know to eliminate NAs.

danielmawhirter commented 3 years ago

I'm definitely open to improvements here, as the current state isn't the greatest. A couple interesting notes:

Many of the analytics applications expect every index to be valid, so PropertyViews reflect that by treating null values as default-initialized. This is a semantics error, but we currently ignore it to avoid a re-work of those applications. For testing purposes, changing the null check to an assert() in Properties.h will show failing tests for every application that makes this assumption.
Querying is a different story, and while it's not meaningful in the context of this repo necessarily, we do already fail fast for improper null accesses in that domain.
Writing into arrow arrays / parquet files in a reasonable way is a challenge. The simple case of writing and updating the null bitmask isn't too complicated in serial (there's some work to be done for concurrency). The more complicated case shows up in String, or any non-fixed-size datatype, where dynamic allocations would be required for writes.

Because of point 1, we've already been able to load and access data with nulls while preserving the assumption of validity in analytics. But point 3 still leaves significant challenges.

Do you have an overall vision of how to address this?

arthurp commented 3 years ago

To be clear, nothing I'm talking about here will affect PropertyFileGraph. So dynamically typed access or access from Python will be unchanged. This issue is only with the PropertyGraph interface.

My plan was simply to remove IsValid and make it an error to create a PropertyView on any arrow array that has null count > 0. Since PropertyView is only a C++ programming utility, not actually a fundamental part of the system (it's not used for accesses from Python for instance), I don't think this will be an issue.

BTW, I am concerned about the "write to Arrow Array thing". But I'm not trying to address that here, and I'm leaving that to the RDG people.

danielmawhirter commented 3 years ago

It would induce substantial changes to distributed because the Partitioner uses PropertyView on arrays with nulls. That's on the enterprise repo as well though, so I don't have a good answer.

roshandathathri commented 3 years ago

My plan was simply to remove IsValid and make it an error to create a PropertyView on any arrow array that has null count > 0.

This would break the partitioner. Properties are expected to be sparse, so I don't understand why it is an error to create a properties array with nulls?

arthurp commented 3 years ago

The issue with nulls in PropertyViews is that they only work correctly if the PropertyView is not mutated (or at least the sparsity doesn't change). PropertyView doesn't update the null bitmap when writes occur. This is dangerously inconsistent I think.

What if we split PropertyView in to two parts: PropertyView which supports nulls as it does now, but will no longer allow writes (it will return const references); and MutablePropertyView which allows writes (non-const references), but returns an error if the property is sparse. This is a very preliminary plan. I'm not sure how this would be handled in PropertyGraph, yet. Maybe separate accessors for sparse properties that always return const references.

In general, will this approach work for the partitioner? Does it need to mutate the data in it's sparse PropertyViews?

danielmawhirter commented 3 years ago

It might be reasonable to assert no nulls in reference GetValue(size_t i) as a stop-gap. Basically, views are currently being used as random access builders, where arrow only supports serial builders, right? I have some unfinished work to provide random access builders (which will support nulls), which if fully integrated would remove the need for Views to be writable.

arthurp commented 3 years ago

I have some unfinished work to provide random access builders (which will support nulls), which if fully integrated would remove the need for Views to be writable.

Great.This issue can wait until that stuff lands. I mainly created this issue to make sure we didn't forget that there was a potential problem that needs to be fixed eventually.

roshandathathri commented 3 years ago

Other than PODPropertyView, other views are read-only views. I'm strongly in favor of limiting all these views, including PODPropertyView, to read-only views. For read-write use csaes, @danielmawhirter has some wrappers around arrow Builders. We should be using those for read-write use cases. @ddn0 What do you think?

The partitioner uses PropertyView for read-only views because it handles string and boolean types in addition to POD types. So limiting to PropertyViews to read-only does not affect it.

danielmawhirter commented 3 years ago

Looks like they're in this repo, https://github.com/KatanaGraph/katana/blob/master/libgalois/include/galois/ArrowRandomAccessBuilder.h

The version that's there currently uses vector for its initial storage, and converts to arrow at the end. The unfinished part I mentioned removes the need for vector (at least for POD properties). I'll ping this issue when I get to working on that!

ddn0 commented 3 years ago

If I understand it correctly, there are basically three current use-cases for PropertyView:

Analytics applications that need to random-access read-write to fixed-width fields associated with (primarily) nodes and maybe edges
Building new property graphs in the partitioner and elsewhere
Convenience casting arrow typed properties into C++ typed properties (in similar spirit to arrow::stl::TableFromTuple)

For read-write use [cases], @danielmawhirter has some wrappers around arrow Builders. We should be using those for read-write use cases.

I don't think the builder API is going to support case 1 that well.

My suggestion is the @danielmawhirter + @arthurp approach:

A mutable PropertyView but only ever supporting fixed-width types (potentially with the restriction of never supporting nulls either)
PropertyView being read-only but fully supporting arrow types
A specific builder API and weaning the partitioner off using mutable PropertyViews

arthurp commented 3 years ago

@ddn0 makes good points here. I support this overall.

(potentially with the restriction of never supporting nulls either)

I don't think we can safely support nulls unless we use a setter API instead of an "assign to reference" API. And writing to bitmaps in parallel is really hard (oh the false sharing). So I think we will want to ban nulls on mutable views and only reintroduce them if we have a good design and a specific need.

roshandathathri commented 3 years ago

I'm uncomfortable about (1) both due to the restriction on nulls and due to the restriction to fixed-width types. However, I'm not strongly against it.

arthurp commented 3 years ago

@roshandathathri can you provide specific reasons for your discomfort? Such things are very useful. Like a use case or an expectation that a user might have.

roshandathathri commented 3 years ago

My discomfort about (1) is about whether we can automatically check and enforce the assumptions like fixed-width types and no nulls without facing an undue burden to the application developer (who is the user I'm concerned about here). If there's a clear and enforceable choice for the application developer, then I would be more comfortable with it.

As an example consider applications like matrix-completion and mrbc where the node data is a vector. If we are developing these applications, then we would know which datatype is best to use. If they are implemented by users of Katana (python interface), do we have the same confidence that they can pick the right datatype?

KatanaGraph / katana

Rework "NA" (i.e., null, invalid) handling in PropertyViews #25