finch-tensor / Finch.jl

Sparse and Structured Tensor Compiler
http://willowahrens.io/Finch.jl/
MIT License
161 stars 16 forks source link

Level Format Interface RFC #77

Closed willow-ahrens closed 1 year ago

willow-ahrens commented 1 year ago

Let's clean up the level formats a bit so that they can interoperate with other systems! A level is essentially an allocator for vectors, which allows us to represent e.g. matrices efficiently as vectors of vectors.

If all goes well, the level formats will do mostly the same thing they already do, but the interface will be clear. This is about reorganizing and documenting what already works.

1. A set of functions that defines a level at runtime

Currently, the Fiber type is declared as

struct Fiber{Lvl, Env}
    lvl::Lvl
    env::Env
end

If a level lvl is an allocator for vectors, fiber(lvl, env) is one such vector, where env holds enough information to differentiate the two.

We can then define AbstractArray methods on the fiber as, e.g. Base.ndims(fbr::Fiber{<:SparseBytemapLevel}) = 1 + ndims(Fiber(fbr.lvl.lvl, Environment(fbr.env)))

We can access an element of a fiber as fbr(idx).

The problem is that it's unclear what information needs to be put in the environment to make a fiber out of a level. The environment is currently just a dict that we put information into. We can put whatever info we want into the environment, but most of the time, the most important variable is the position. If the level were a vector of fibers, the position is an integer specifying which fiber to use.

I think we should formalize a level as a vector of fibers. Each level should return its own fiber object. The index with which we access a level becomes it's position, and we will see that several other functions have good analogies here.

2. A set of functions that defines a level at finch compile time

getname(tns) gets the name of the tensor.

getsites(tns) gets the names of the modes of the tensor. This is useful for dimensionalization.

getsize(tns, ctx, mode) gets the sizes of the modes of the tensor.

setsize(tns, ctx, mode) sets the sizes of the modes of the tensor. Notably, setsize can currently get called on a tensor that is being read if a tensor with the same name has previously been written to and the new sizes need to be propagated. Not sure if we like that.

getsize and setsize might be generalized a bit to include other uses of declare_dimensions_access and infer_dimensions_access, so that index modifiers like permissive indices and protocols can avoid directly overloading those.

default(tns) is the value a tensor is initialized to, and often is the "zero" value.

initialize_level! currently emits code to set a tensor to its default and get it ready for iterating. This is like zero!(lvl) or perhaps empty!(lvl)

finalize_level! currently emits code to clean up a tensor after writing to it. This is when one calls qsort on a list of coordinates.

initialize sometimes needs to be called on subranges of the level, so we should add specializations.

assemble! currently emits code to allocate a new fiber in a level. This is like a resize! on the level vector.

unfurl(fbr, ctx, mode, idx, idxs...) emits looplets to iterate over a fiber. Often, idx is inspected to determine which protocol to use. At the end, exfurl is called recursively to pass the looplets through index modifiers. I want to tweak this interface to pass along an extent during unfurling. I also want to remove the responsibility for the tensor unfurl to handle the tail (irrelevant) indices or call itself recursively.

refurl(tns, ctx, mode, idxs...) gets called on a fiber just after it is created. It can be helpful to emit code here if you want to do something eagerly.

stylize_access(node, ctx, tns) This gets called on tensors to ask whether they want to participate in Looplet stuff. I think this should be a trait instead.

TL;DR: Levels are vectors of fibers, make levels define their own Fiber types, use traits to avoid boilerplate.

rayegun commented 1 year ago

Could you make up a tiny example notebook with a basic CSR (and maybe one additional format) tree that shows these functions in action? It's hard to concretize some of them in my head.

Specifically env is still not very clear to me.

willow-ahrens commented 1 year ago

There is kindof a weird dynamic going on with multi-index levels. The interface between a level and its child is clear, but doesn't (and shouldn't) apply to intermediate results of unfurling a multi-index level (i.e. popping one index of a multi-index level). I think it makes sense to at least document which properties are expected to be defined in environments, but we might want to move to a trait-based interface where each level defines its own Fiber.

willow-ahrens commented 1 year ago

I want to make a notebook, but I'm not sure I'll have time until after the camera ready deadline. This is the sort of interface that is difficult to make better until there are many examples of what it needs to do.

rayegun commented 1 year ago

What are the existing multi-index levels? Examples of where they appear

willow-ahrens commented 1 year ago

https://github.com/willow-ahrens/Finch.jl/blob/main/src/sparsehashlevels.jl https://github.com/willow-ahrens/Finch.jl/blob/main/src/sparsecoolevels.jl

rayegun commented 1 year ago

Wait I don't understand, how is COO multi-index? Isn't it singleton - singleton - ...?

willow-ahrens commented 1 year ago

If we encode singletons by themselves, we need to come up with a much more complicated level interface that handles non-uniqueness/non-sortedness. Encoding singletons by themselves also doesn't map well to Array-Of-Structs COO. I found it easier to manage the indices all in the same level.

willow-ahrens commented 1 year ago

One possible approach here is to introduce a struct that allows us to just emit looplets for more than one index at a time (i.e. just unfurl the whole tensor at once, why not?). This would eliminate the need for complicated environments that reflect internal state that doesn't actually need a formal interface.

willow-ahrens commented 1 year ago

We would just need to leave a little wrapper struct to mark which places in the looplet nest expect to accept a new index.

willow-ahrens commented 1 year ago

This would make it relatively easy to write looplets for existing sparse matrices too, as we could just emit a double-index looplet for the array without introducing new types

willow-ahrens commented 1 year ago

A few remaining tasks:

willow-ahrens commented 1 year ago

After several fixes, these changes have finally been implemented and starting documentation was added with #117. I think more docs would be nice, but at this point this rfc is closed and we can move on.

willow-ahrens commented 1 year ago

To summarize, in the end env was simplified/restricted to just pos.