Closed jw3126 closed 5 years ago
That's because I'm using inference API too much. I have a few ideas to avoid relying on inference too mcuh and want to do it at some point but it would be a rather big surgery.
Its great, that you could fix the above example! There are however other broken examples:
julia> using Transducers, Test
julia> @inferred collect(Take(10), 1:100)
ERROR: return type Array{Int64,1} does not match inferred return type Any
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] top-level scope at REPL[9]:1
Ah, thanks for catching this.
So, not all transducers can be inferenced at the moment. Any (combinations of) transducers with needintype
returning true
would break the inference:
julia> Transducers.needintype(Map(exp) |> Filter(x -> x > 0))
false
julia> Transducers.needintype(TakeLast(3))
true
julia> Transducers.needintype(Map(exp) |> TakeLast(3))
true
(BTW, fixing Transducers.needintype
for TakeLast
, DropLast
, Partition
, PartitionBy
, and Unique
should be easy now. They all need to use push!-or-widen variant which I call push!!
.)
But actually Take
should work since
julia> Transducers.needintype(Take(3))
false
I thought something like the following with collect(Take(10), Base.Generator(identity, 1:100))
(since only iterator's __foldl__
implements the "tail-call function-barrier" approach at the moment) would fix it. But it didn't help....
diff --git a/src/library.jl b/src/library.jl
index 38e4efc9..b5134e51 100644
--- a/src/library.jl
+++ b/src/library.jl
@@ -373,18 +373,23 @@ end
start(rf::R_{Take}, result) = wrap(rf, xform(rf).n, start(inner(rf), result))
-next(rf::R_{Take}, result, input) =
+@inline next(rf::R_{Take}, result, input) =
wrapping(rf, result) do n, iresult
+ Base.@_inline_meta
if n > 0
- iresult = next(inner(rf), iresult, input)
+ iresult′ = next(inner(rf), iresult, input)
n -= 1
+ if n <= 0
+ return n, _complete_take(rf, iresult′)
+ end
+ return n, iresult′
+ else
+ return n, _complete_take(rf, iresult)
end
- if n <= 0
- iresult = reduced(complete(inner(rf), iresult))
- end
- return n, iresult
end
+@inline _complete_take(rf, iresult) = reduced(complete(inner(rf), iresult))
+
"""
TakeLast(n)
Thanks for the explanations! It is counter intuitive, that Take
is harder to infer then e.g. Filter
.
I guess that's kind of a duality principle. Everything that is easy in iterators is "co-easy" in transducers (and vice versa).
I mean, Filter
is super simple
and look at the Take
code above :rofl:
That's a cool explanation. A bit off topic: Do you know if there is a precise sense in which transducers
and iterators
(or rather some kind of functions iterator -> iterator
like map, filter,...
) are dual?
I'm actually very interested in this, too. I used the notion of duality in a totally non-rigorous sense above (i.e., iterators are "pull"-based API and transducers are "push"-based API). But I cannot help wondering if there is a way to actually formalise that. Unfortunately, my category theory knowledge is still too elementary to find a good formalism...
Here are some observations that might be a starting point. What I think needs to be done is to define the following things:
Iterator{State, Item}
RedFun{Result, Item}
Iterator{State, Item} → Iterator{State', Item'}
Examples are map, take, filter
RedFun{Result, Item} → RedFun{Result', Item'}
.
These are called transducers and examples are Map, Take, Filter...
Once they are defined we need to show that Iterator
and RedFun
are dual.
I am unsure what the right notion of morphism these things is. My feeling is that Map, Take, Filter...
are more special then arbitrary functions Iterator → Iterator
.
But here are at least some toy definitions of RedFun
and Iterator
and a sense in which they are dual:
Here is the most simple model of an iterator I could come up with:
iter:: State → Item × State
If you want to run it, it is up to the user to provide an initial state. Then the iterator will produce an infinite series of items.
Turning around the arrows and renaming State
to Result
we get:
rf::Result × Item → Result
This is a bare bones reducing function. It has no start
and no completion
iter:: State → Maybe(Item × State)
If you want to run it, you still need to provide an initial state.
From this it produces items until it stops at some point by returning nothing
Turning around the arrows and renaming State
to Result
we get:
rf::Maybe(Result × Item) → Result
Now a map Maybe(X) → Y
is the same as a y::Y
along with a map X → Y
This means rf
consists of initial_result::Y
and Result × Item → Result
. So we have a starting reducing function.
iter:: Maybe(State) → Maybe(Item × State)
So this is data is equivalent to
init::Maybe(Item × State)
next::State → Maybe(Item × State)
So this is a full blown iterator.
Turning around the arrows and renaming State
to Result
we get:
rf::Maybe(Result × Item) → Maybe(Result)
This is equivalent to
initial_result::Maybe{Result}
next::Result × Item → Maybe(Result)`
I think we can interpret this a reduction function with early stopping.
At some point it is done and signals this by returning nothing
.
Here are some odd things / issues with this:
Maybe
. Interator
in a category does not give us a RedFun
in the opposite category.map
can be dualized into Map
.Actually, I just remembered that I've heard related concepts: Catamorphism and Anamorphism. Reading the Wikipedia pages (at least the parts that I can guess the meaning) and also watching this Lambda World talk by Zainab Ali, it seems like the relationship is something like:
Julia | Anamorphism | Julia | Catamorphism |
---|---|---|---|
iterator (iterate ) |
F-coalgebra coalg |
reducing function (rf ) |
F-algebra alg |
"iterator transformation" (iter -> iter' ; e.g., iter -> map(f, iter) ) |
coalg -> coalg' ? |
transducer (e.g., Map ) |
alg -> alg' ? |
"(generic) collect" (collect , for Vector s) |
ana(coalg) |
fold (xs -> foldl(rf, xs) ) |
cata(alg) |
push! , for Vector s (kind of) |
final F-coalgebra out |
xs -> (xs[1], xs[2:end]) (kind of) |
initial F-algebra in |
It looks like that they are pretty much like what you suggested. For example, this (from Wikipedia) does exactly look like your Model 2:
A Haskell definition of an unfold, or anamorphism for lists, called ana, is as follows:
ana :: (state -> Maybe (value, state)) -> state -> [value] ana f stateOld = case f stateOld of Nothing -> [] Just (value, stateNew) -> value : ana f stateNew
So, I guess the good news is that duality of catamorphism and anamorphism is already there. But does it give us the duality of the transducers and "iterator transformations" automatically? I mean, as you said,
- This does not talk about morphisms, and gives no clue how e.g
map
can be dualized intoMap
.
sounds like a pretty hard problem.
- While Model3 seems like the right notion of iterator, I am not sure if is also the right notion of reduction function.
Rich Hickey's transducers (and so Transducers.jl) use "Union{Result, Reduced{Result}}
" as the termination mechanism (see https://tkf.github.io/Transducers.jl/dev/manual/#Transducers.Reduced). I guess a more Haskell-like way would be Either{Stopped, Result}
where Stopped
is the type of the value to be returned in the early stopping case (so mostly Stopped == Result
). But it's different from Maybe{Result}
. Perhaps it's related to your point about "not dualize the Maybe
?" I need to look into catamorphism a bit more to see what's going on here.
(BTW, going back to the original topic, many more things are inferable now in dev
branch:
It's not yet merged into master
as it needs other unregistered packages of mine.)
It is really cool that this library becomes more and more inferable. Thanks for the great work! Thanks also for sharing the anamorphism stuff and the table! I will need to think a bit about it.
As for defining iterator transforms / transducers / duality, here is another idea:
We have a pairing
foldl::RefFun × Iterator → Result
Now given an arbitrary map f::Iterator -> Iterator
and an arbitrary map F::RedFun → RedFun
we say that they are adjoint, if the following holds:
foldl(F(_), _) = fold(_, f(_))
Under this definition it should be easy to see that (Map, map), (Filter, filter), (Take, take)...
are adjoint pairs. One could even define a transducerer as a an arbitrary map RedFun → RedFun
that has an adjoint and similar for iterator transforms.
Thanks! Yeah, I think Transducers.jl finally got to a beta-ish-level quality so that it's ready for non-toy examples.
Your observation about adjoint is very interesting.
One could even define a transducerer as a an arbitrary map
RedFun → RedFun
that has an adjoint and similar for iterator transforms.
So I guess this corresponds to the rule in transducers that you shouldn't touch result
in rf(result, input)
. Same thing for state
in iterate(iter, state)
.
This definition is quite satisfactory as it seems to correspond to the reason why eduction
((xf, iter) -> iter
) is implementable at all. I guess the other direction ((ixf, rf) -> rf
; ixf
: iterator transform) is also possible (and would be as hairy as eduction
).
So I should be doing Base.adjoint(::typeof(map)) = Map
etc.!
So I guess this corresponds to the rule in transducers that you shouldn't touch
result
inrf(result, input)
. Same thing forstate
initerate(iter, state)
.
Yeah I think Result
serves two distinct roles. For stateless red funs there is no problem. Result
simply the final result of the computation. But for stateful red funs (like Take(3)(rf)
) result should be split into (private) State
, what is used at each step in the computation and (public) Result
which is the final result of the computation. So I think conceptually we have RedFun{_State, Item, Result}
and Iterator{_State', Item}
So I should be doing
Base.adjoint(::typeof(map)) = Map
etc.!
At first I thought this is a great idea. But now I am not sure about it. In a perfect world we would have
foldl(adjoint(f)(rf), itr) == foldl(rf, f(itr))
This would allow for instance to write powerful + elegant unit tests:
for F in [Map(sin), Take(3), Enumerate()]
f = adjoint(F)
@test foldl(adjoint(f)(rf), itr) == foldl(rf, f(itr))
end
Here this fails for two reasons. The proposal is at the type level and missing currying.
Ah, right. I need to do Base.adjoint(xf::Map) = itr -> map(xf.f, itr)
and so on.
That would be more correct. But then returning lambdas is painful. They don't display nicely, give strange error messages and you can't dispatch on them, which is needed for the other direction adjoint(map(f)) == Map(f)
.
Yeah, I agree it won't be practically useful (except for the unit test you mentioned which is already a good motivation for defining it, at least with a private name).
Yeah I think defining the lambda returning version under the name Transducers.adjoint
is fine.
So I implemented the iterator transform-to-transducer mapping (itr -> itr) -> rf -> rf
: https://gist.github.com/tkf/331e3941355881e0a69809cf1c3b1327
It looks like it's working
julia> collect(Transduction(itr -> Base.Generator(x -> x + 1, itr)), 1:2)
2-element Array{Any,1}:
2
3
julia> collect(Transduction(itr -> Iterators.filter(isodd, itr)), 1:2)
1-element Array{Any,1}:
1
As I thought, it's pretty ugly. For example, next
is:
next(rf::R_{Transduction}, result, input) =
wrapping(rf, result) do (itr, itstate0, buffer), result
push!(buffer, input)
if itstate0 isa Unseen
ret = iterate(itr)
else
ret = iterate(itr, itstate0)
end
while true
if ret === nothing
return (itr, itstate0, buffer), reduced(complete(inner(rf), result))
end
itout, itstate = ret
result = next(inner(rf), result, itout)
if result isa Reduced || isempty(buffer)
return (itr, itstate, buffer), result
end
ret = iterate(itr, itstate)
end
end
I probably didn't need to this but it's nice to have a "constructive proof" showing that iterator transforms and transducers are equivalent in terms of the semantics (but (I believe) not the performance characteristics).
Anyway, I appreciate all the ideas! I now feel like I understand iterators and transducers more.
Thanks for sharing this. I learn a lot from this discussion. Here are some comments, some of them may be already obvious to you or slight reformulations of things you already said.
You defined the function leftadjoint
that satisfies:
foldr(leftadjoint(f)(xf), itr) = foldl(xf, f(itr))
The other direction rightadjoint::(rf -> rf) -> (itr -> itr)
is given by:
using Test
using Transducers
function rightadjoint(xf)
function itf(itr)
eduction(xf, itr)
end
end
xf = Map(cos) |> Take(4) itr = randn(10) rf = + @test foldl(rf, xf, itr) == Base.foldl(rf, rightadjoint(xf)(itr))
* So that means we have a 1:1 correspondence between transducers and iterator transforms? We are not there quite yet I think. There is a fishy detail, namely `leftadjoint` will not halt in general. We still need proper definitions of `RedFun`, `Iterator` etc. to prove that `leftadjoint` and `rightadjoint` give us a 1:1 correspondence.
* If we restrict to finite iterators I believe I can define these terms and prove the 1:1 correspondence.
- namely
leftadjoint
will not halt in general
It seems that this is just because foldl
is not a total function (and we need foldl
in the definition). rf
can simply be non-total or itr
can be infinite (problem if rf
is non-reducing). There is also a particularly bad combinations like Cat()
with [[0], [1], infinite_iterator]
. But this is a PITA because I know that the transducer
repeatfirst(rf) =
function(result, input)
while true
result = rf(result, input)
result isa Reduced && return result
end
end
has an obvious adjoint iterator transform.
If we are OK with dealing with equality of infinite iterators (as done in math with infinite sets), I guess I can generalize your definition of adjoint by saying that transducer F
and iterator transform f
are adjoint iff for any iterator itr
, following program never crashes
xs = Channel() do xs
foldl(F(put!), itr; init=xs)
end
ys = Channel() do ys
foldl(put!, f(itr); init=ys)
end
while true
isclosed(xs) && isclosed(ys) && break
@assert take!(xs) == take!(ys)
end
where isclosed
is a function that can check if the next take!
would throw.
...but what I wrote with Channel
was eduction
in disguise (and eduction
is un-carried rightadjoint
). Am I just saying something tautological? But I'm guessing it's OK. We just need a mechanism to deal with multiple infinite iterators concurrently (which is Channel
in Julia). It just turned out that rightadjoint
can be constructed programmatically using it.
Interestingly, even though Channel
makes rightadjoint
trivial to write, I don't think it makes leftadjoint
easier to write (OK, maybe just slightly).
Ah nice. One way to think about the last few posts is like this: There are two approaches to guarantee termination. One is on the iterator side and one on the reduction function side.
put!
), which I think of as the universal finite rf.
- defining finite rf
Actually, I think we only need to show it for "basis" reducing functions. By that, I mean the reducing functions generated by following helper function:
reduce_on_nth(n) =
function (i, input)
if i == n
return Reduced(input)
else
return i + 1
end
end
The reducing function reduce_on_nth(n)
reduces on the n
-th invocation with the input
at that point. Then, we can say that transducer F
and iterator transform f
are adjoint iff for any iterator itr
and for any natural number n
foldl(F(reduce_on_nth(n)), itr; init=1) == foldl(reduce_on_nth(n), f(itr); init=1)
holds. (More precisely, we need to use Transducers.transduce
instead of Transducers.foldl
as the latter un-wraps the returned value automatically.)
- the fix on the iterator side (arbiratry rf, finite itr)
This sounds very challenging to me. I mean, there are iterator transforms that turn finite iterator to an infinite one (e.g., an iterator transform adjoint to the transducer repeatfirst
in my example above). So, I think allowing arbitrary reducing function would need us to be able to solve the original problem directly. (This line of thought actually led me to write reduce_on_nth
above.)
- defining finite rf
Actually, I think we only need to show it for "basis" reducing functions. By that, I mean the reducing functions generated by following helper function:
That is a nice idea, too. Like push
I think of these "basis" reductions as a surrogate for all finite rfs.
- the fix on the iterator side (arbiratry rf, finite itr)
This sounds very challenging to me. I mean, there are iterator transforms that turn finite iterator to an infinite one (e.g., an iterator transform adjoint to the transducer
repeatfirst
in my example above). So, I think allowing arbitrary reducing function would need us to be able to solve the original problem directly. (This line of thought actually led me to writereduce_on_nth
above.)
This was about exploring the theory. In order to prove something you need to impose finiteness conditions somewhere. For instance +
is an reasonable reduction function and repeated(42)
is a reasonable iterator. Both are fine on their own, but foldl(+, repeated(42))
will not halt.
Of course there are transforms that turn finite iterator into infinite iterator. But so there are transducers that turn finite red fun into infinite red fun. Anyway in practice I would not worry much about this.
This was about exploring the theory.
This is my interest too. At this point I don't care much about direct "benefit" of this exploration in the practical coding (though obviously it would be great if there is any; the unit testing example you brought up above was such an example). So, it's purely for fun and I appreciate the entertainment!
What I meant by "challenging" was it seemed to me that formalising it via the "finite itr" route was difficult. I mean, it's hard to imagine for me that there is any "special effect" due to the finiteness of iterator compared to the finiteness of the reducing function.
In order to prove something you need to impose finiteness conditions somewhere.
Isn't it implicitly there already as we are comparing the result of foldl(xf(rf), itr)
and foldl(rf, itf(itr))
(they must be computable)? By that, I'm thinking something like:
Definition. A pair of transducer xf
and iterator transform itf
is adjoint iff, for all reducing function rf
and iterator itr
such that foldl(xf(rf), itr)
terminates, foldl(rf, itf(itr))
terminates and
foldl(xf(rf), itr) == foldl(rf, itf(itr))
holds.
Directly showing it is hard, but I think we have a:
Theorem. _xf
and itf
are adjoint if they satisfy the reduce_on_nth
-based equality in https://github.com/tkf/Transducers.jl/issues/13#issuecomment-508957920_
Proof. For rf
and itr
such that foldl(xf(rf), itr)
terminates, rf
is called only finite number of times n
. The equality in https://github.com/tkf/Transducers.jl/issues/13#issuecomment-508957920 shows that the series of n
inputs that rf
receives for foldl(xf(rf), itr)
and foldl(rf, itf(itr))
are equal. To show that foldl(rf, itf(itr))
terminates, let's consider two cases separately.
If foldl(xf(reduce_on_nth(n)), itr; init=1) isa Reduced
, it means that the termination is due to xf
and itf
. It also indicates that the termination occurs at the same time in both cases. Thus, foldl(rf, itf(itr))
terminates if foldl(rf, itf(itr))
terminates.
We now consider another case foldl(xf(reduce_on_nth(n)), itr; init=1) !isa Reduced
. It implies that itr
produces only a finite number of items. Thus, for all k > n
,
foldl(xf(reduce_on_nth(k)), itr; init=1) ==
foldl(reduce_on_nth(k), itf(itr); init=1) ==
n + 1
Note that the value n + 1
cannot be produced by itf
as any output of itf(itr)
will be wrapped by Reduced
. Thus, itf(itr)
produces exactly n
items in this case and foldl(rf, itf(itr))
must terminate. □
Okay here is what I had in mind: Definitions
FinItr{I} := List{I}
ItrTrafo{I1,I2} := FinItr{I1} -> FinItr{I2}
RedFun{Result, Item} := FinItr{Item} -> Result
Observe that RedFun{_, Item}
is a functor (aka Hom(List{Item}, _)
).
We write foldl
for the pairing RedFun{Result, Item} -> FinItr{Item} -> Result
Transducer{I1, I2} := RedFun{_, I1} -> RedFun{_, I2}
Comments
Theorem
There is a canonical isomorphism:
Transducer{I1, I2} = ItrTrafo{I2, I1}
Proof:
This is a special case of the Yoneda lemma.
Now we have our adjunction. Observe that so far everything is abstract nonsense. We did not use finiteness of your iterators or that they are lists.
However we should clarify how the abstract nonsense definition of RedFun{R,I}
relates to our intutive notion of reduction function:
Theorem
Let rf::RedFun{R,I}. Then there exists a type
State` and objects
start::State
next::I -> State -> State
complete::State -> R
such that
rf(itr) = complente(classic-foldl(next, itr, init=start))
Proof:
The idea is to just memorize the whole itr
and do everything in complete:start := []
next := push
complete := rf
Cool. Yoneda lemma has been in my things-to-learn list for a long time. I really should do it...
Going back to the discussion about the "finite itr" route, as you said, you don't really need it during the category theoretic construction. So I suppose you defer mentioning finiteness until you are about to connect to the classic foldl? You can then say that itr :: Itr{Item}
has an equivalent/canonical finite list representation as rf(itr)
can only consume a finite number of items when rf
is interpreted as a function. Also, why not define RedFun{Result, Itr} := Itr -> Result
instead of RedFun{Result, Item} := Itr{Item} -> Result
etc.? This would get rid of the non-totality problem as Itr
can just be a set of iterators such that rf(itr :: Itr)
would terminate (when interpreted as a function).
Yeah you are completly right, finite iterators is only needed for the classical interpretation of RedFun
.
And defining RedFun{Result, Itr}
is a good idea. However I am not sure what a good notion of inifinite iterators is, or rather transforms between them.
I think allowing arbitrary maps is wrong here. I think any reasonable transform between iterators is completly determined by its restriction to finite iterators.
On the reduction function side I would like to see the classical notion with early stopping.
Above you said that you believe that transducers and itertransforms have different performance characteristics. Can you expand on that?
However I am not sure what a good notion of inifinite iterators is, or rather transforms between them.
The trick I was trying to use so far is to say that, once the reduction is finished, we don't need to care about infinity and the theory about finite iterators work fine. That is to say, we can model iterator transforms on "infinite iterators" as the transforms that works with finite iterators of arbitrary length (OK, I guess maybe it's kind of obvious but I thought this notion was enough.)
BTW, I think hiding "element type" of the iterator like RedFun{Result, Itr} := Itr -> Result
is somewhat related to Rich Hickey's point that reducing/step function needs "superscript" in the result type: https://youtu.be/6mTbuzafcII?t=1720
I think allowing arbitrary maps is wrong here.
Why not? Set-theoretic maps suppose to terminate so I guess there is no difficulty there? I think allowing potentially non-total function is bad, though.
Above you said that you believe that transducers and itertransforms have different performance characteristics.
This will be my main point in JuliaCon talk so it's nice that you asked now!
Actually my focus will be that foldl
can be easily optimized than iterate
(rather than transducers themselves are good). A very straightforward example is LazyArrays.Vcat
. It is a lazy version of vcat
where you keep the original vectors in a tuple. You can write foldl
for this as "unrolled" loops
function __foldl__(rf, acc, coll::LazyArrays.Vcat)
for x in coll.arrays[1]
acc = rf(acc, x)
end
for x in coll.arrays[2]
acc = rf(acc, x)
end
...
for x in coll.arrays[end]
acc = rf(acc, x)
end
return acc
end
(The "unrolling" is actually done via recursion and use __foldl__
for the inner loops as well. See: https://github.com/tkf/Transducers.jl/blob/1dccc3333b93edc8eeb835f1c0335155b344bcdb/src/interop/lazyarrays.jl)
eltype
.coll.arrays[1]
is a Vector
, coll.arrays[2]
is a product
, and coll.arrays[3]
itself can also be a LazyArrays.Vcat
or similar chunked data structures (e.g., BlockArrays
). All of these cases will be recursively specialized (if we use __foldl__
, instead of the loop as in the example above).On the other hand, keeping track the state with iterate
is hard and yield a poor performance. Julia's Union splitting helps to some extent but there is a limit to it.
Also, I think foldl
is just much easier to write than iterate
. Here is an example of Vector{Vector{T}}
:
As an another example, here is my (re-)implementation of Array{Union{A,B,...,Z}}
: https://github.com/tkf/UnionArrays.jl. You can get a type-stable reduction with arbitrary length of Union if you have rf(::R, ::Union{A,B,...,Z}) :: R
. A quick benchmark shows that it's much better than Array
but I need to put it somewhere... Of course, Array
also has to be optimized for compile time so maybe that limits how much trick they can use.
The big message is that, if you implement a collection, you are the best person to write the loop because you know all of its properties (memory layout, types, etc.) and can encode them into the looping strategy.
However, classic foldl
in Base
cannot be used as a drop-in replacement for iterate
because:
The first part is solved via Reduce
and the second part is solved via transducers.
Another benefit of transducer is that it's easy to do fan-out type reduction. I don't think it's possible with iterator transforms (with the same memory requirement). See the example of GroupBy
combined with ReduceIf
: https://tkf.github.io/Transducers.jl/dev/manual/#Transducers.GroupBy
Also, there is a crazy example where Filter(!ismissing)
can be inferred to exclude Missing
but filter(!ismissing)
can't. Maybe it means that transducers are more "compiler friendly" as it introduces tons of function boundaries?
To be fair, let me think about some benefits of the iterators:
@inbounds
. I don't think there is a generic solution to it in foldl/transducers (if you can prove that transducers are non-expanding, you can do the same trick; but not with generic transducers).for
syntax is much better than foldl
or foreach
!What do you think about the whole logic of foldl/transducers -vs- iterators? I'd appreciate your comments on this!
The argument that the author of an iterator type is usually the best person to implement the reduction loop is very compelling to me. In the past I sometimes implemented specialized foreach
and used foreach
instead of for loops on my iterators. This library is a better solution.
I don't have much else to add to transducers vs iterators. I found your post quite interesting and agree with everything. Lookking forward to you juliacon talk.
I can maybe explain a bit why I am excited about this library. I work with phase space files. These are basically gigantic lists of particles. Too large to fit into memory. And I am interested in questions like:
A list of all photons that will hit this tiny box
What is the average direction of particles weighted by their charge in each cell of this grid.
A histogram of the energy of neutrons that ...
Most (all?) of these questions can be build up from basic operations like filter
(restrict to electrons), map
(to select the energy property), groupby
(each cell of grid)...
So I need a framework to build up complex operations from these primitives. Ideally that framework should have the following properties:
Does not allocate intermediate iterators. These are often too large to fit into memory.
Should be performant
Should not be verbose, I like to interactively ask tons of quick questions on truncated files get a rough impression.
Should be composable, I like to mix and match. E.g. ask on the same selection of particles for different properties etc.
The only composition mechanism other then Transducers.jl
is building up functions by hand (e.g. take(10, filter(cond, map(f, itr)))
) . As for primitives Base.map
and friends are problematic because of allocations. Lazy variants work better, but I also encountered performance issues last time I checked or they miss features. And composition is still
take(10, filter(cond, map(f, itr)))
.
Often I end up using Base
functions for exploration and ad hoc hand written for loop to get good statistics.
Transducers.jl
is very promising. In theory it fullfills all of the above points. In practice inferrence is not quite good enough yet. So I don't actually use them as of now. It is great that you are working on inferrence, big thanks! If inference gets good I think this will be one of my favorite julia libs.
Thank you very much for very detailed inputs! It is encouraging that you find Transducers.jl promising.
I sometimes implemented specialized
foreach
Good to know. Maybe you'd find AdHocFoldable
useful. That's a utility function I added recently: https://tkf.github.io/Transducers.jl/dev/manual/#Transducers.AdHocFoldable
groupby
FYI, GropuBy
in transducers is not tuned yet. When I looked at group-by in DataFrames.jl, SplitApplyCombine.jl, etc., they all use some internals on Dict
to minimize hashing. There is no tuning like this at all ATM. Just so you know that you won't be disappointed when you test it :)
In practice inferrence is not quite good enough yet.
Ah yes, I guess that's important point to list as a con of transducers (a pro for iterators). This happens typically because transducers have to handle Union{Result, DefaultInit{op}, Reduced{Result}, Reduced{DefaultInit{op}}}
. [1] On the other hand, iterators only have to handle Union{State, Nothing}
. So, more examples work when init
is provided (e.g., @inferred foldl(+, Take(1), 1:1)
fails but @inferred foldl(+, Take(1), 1:1; init=0)
works). I think in many examples you can find a reasonable init
; but I know it's not really convenient...
I think solving the inferrability problem with current Julia compiler requires to rely on the compiler API. I was relying on it too much before. Now, I do the complete opposite and try not to rely on the inference at all to decouple the semantics from the compiler. But I think opt-in compiler dependency is OK. It partially is implemented and you can use it with init=Transducers.OptInit
. So far it only is used for "empty collection" case where the reducing function is never get called. More aggressive version would be to use it for setting init
before starting the reduction. (But using inference API in such a way that the compiler can infer such result is rather tricky...)
Another approach would be to forget about inferrability of the outer most call and let function-barrier type-stabilize the loop. As most of the type instabilities are coming from the "not yet started" marker object, the majority of the loop can be done in the type stable way. This would require what I call "tail-call function-barrier" technique but it makes the implement really ugly (see also: https://discourse.julialang.org/t/25831). I'm doing this only for generic iterator fallback because doing this with SIMD support is going to be really hard (although this may be the only reasonable option for now...). Also, I haven't experimented how this interacts with nested for loop case. It may require for you to wait until the type-stabilization bubble up to the outer most loop.
[1] DefaultInit{op}
is the default init
for foldl(op, ...)
; see also https://tkf.github.io/Transducers.jl/dev/examples/empty_result_handling/. By the way, initially, I thought it might be related to const MAX_TYPEUNION_LENGTH = 3
in Julia compiler but it turned out re-building Julia with larger limits doesn't help....
As most of the type instabilities are coming from the "not yet started" marker object, the majority of the loop can be done in the type stable way.
I take this back. If you have, e.g., Filter(pred) |> Take(1)
where pred
very rarely (or never) holds, then the majority of the iterations have to deal with Union
....
I am closing it as I opened #18 to track the actual inference issue. But please feel free to keep posting (or open a new issue). This thread was fun!
This thread was fun!
Yeah, learned a lot in this thread!
There are problems with the inferred return type of collect: