Open eernstg opened 3 years ago
I see two or more approaches here.
toList
. Good if you have an expensive initial computation, then cheap iteration.length
would immediately fill it up). Good if each element computation is expensive. Both are potentially costly in memory when the cached values aren't needed any more. If it's a list, that's less surprising than if it's an iterable.
Almost exactly this sort of LazyList
pattern became dominant in dartdoc, though unfortunately a refactor to condense it into a class has not happened. Wish I had thought of that!
I can confirm @lrhn 's statement that it is definitely costly in memory when the cached values aren't needed anymore, if you use this idea extensively in your program. The challenging bit is determining when they won't be needed. I could in theory measure and predict for my use case which declarations generate instances iterated over 1, N, or sometimes even specific values of N times. This tends to be consistent between runs of my application. Measuring this is challenging though unless I build a tool to do so specifically, so my in-practice default for a batch application has tended to be "cache them all and let the profiler sort it out if people complain about memory usage". Not exactly a nice outcome.
A really smart LazyList
could have a debugging version that logs how many times each instance derived from a declaration is iterated over and help the programmer not waste memory by suggesting a parameter to an optimized version that knows to either never build a list, or discard it after a certain number of reads. Or it could check in with a manager singleton that responds to memory pressure, or other possibly over-complicated ideas to help users make better tradeoffs.
Worth thinking about though if the cache / no-cache decision on Iterable
s is an important cause of waste -- CPU or memory -- in typical user code.
For comparison, an incrementally caching list could be:
class LazyList<T> extends ListBase<T> with UnmodifiableListMixin<T> {
List<T> _elements = [];
int? _length;
Iterator<T>? _sourceIterator;
Iterable<T> _source;
LazyList(Iterable<T> elements) : _source = elements;
int get length => _length ?? _computeAll();
bool _upTo(int index) {
var length = _length;
if (length != null) return length >= index;
var iterator = _sourceIterator ??= _source.iterator;
while (_elements.length <= index) {
if (iterator.moveNext()) {
_elements.add(_source.current);
} else {
_length = _elements.length;
return false;
}
}
return true;
}
bool get isEmpty => _upTo(1);
bool get isNotEmpty => !_upTo(1);
T get first => _upTo(1) ? _elements.first : (throw StateError("No element"));
T get last {
_computeAll();
return _elements.last;
}
T operator[](int index) {
_upTo(index);
return _elements[index]; // Throws the right error if not long enough.
}
int _computeAll() {
var length = _length;
if (_length != null) return length;
while (_source.moveNext()) {
_elements.add(_source.value);
}
return _length = _elements.length;
}
}
(Concurrent modification of the source would be a problem.)
Cool stuff!
Note that it's trivially possible to reset __list
to null in the LazyList
that I showed, and it would then just be recomputed when needed. It could be interesting to have a global priority queue of weak references to the LazyList
objects whose __list
is non-null, and then allow us to get some references from that priority queue and go out and reset the __list
. Ideally, it would be the GC that did this, based on a heuristic about the frequency of GCs or somesuch.
This would introduce another precondition for usage:
A LazyList(iterable)
should never be created if it is inappropriate to use iterable
to iterate over all its values repeatedly, potentially each time iterator
is called.
See also here.
Oh! I'm just seeing CachingIterable
for the first time. Thanks for this link @leafpetersen!
I think that could be taken as a sign that we should have CachingIterable
(or whatever variant turns out to work best) in a core library. It is in fact not specific to Flutter.
By the way, I think the incremental approach (CachingIterable
and this one) may be slower than the variant that relies on invoking toList()
when the first iteration occurs: With an incremental approach we never avoid the indirection (for instance, we're probably relying on an iterator that uses operator[]
during the iteration, and operator[]
would be implemented by additional code in the caching iterable-or-list.
In contrast, if we use toList()
of the underlying iterable and iterator
of the resulting list then it should be possible for an implementation to detect that the given iterator is an iterator of a system provided list type, and then it could presumably run something like a traditional C-style for loop to do the iteration. (Yes, today we probably just check the iterable, not its iterator, but that shouldn't be too hard.)
So, in the case where it isn't crucial to avoid running toList()
just before the first iteration, this could yield a better overall performance.
Interesting.
One advantage of the late final
with in-line closure call approach that I've been moving toward for dartdoc is that it sidesteps the need for indirection after the first call. (I haven't measured how much a difference this makes... maybe not enough for the complication?). This method is also generically applicable to any data structure that you want cached. And without any underlying implementation support required. A dartpad-able example:
Iterable<String> stringGenerator() {
print('building');
return ['hello', 'there'];
}
class A {
late final Iterable<String> cachedStrings = (() => stringGenerator().toList()) ();
}
void main() {
var x = A();
print('A initialized');
print(x.cachedStrings);
print(x.cachedStrings);
print(x.cachedStrings);
}
The downside is some boilerplate of parens/braces and a bit of headscratching when you first see it.
Yes, a late final
with an initializing expression is actually quite nice (and safe), and if you don't need to write any statements then you could of course just omit the function literal:
class A {
late final cachedStrings = stringGenerator().toList();
}
I guess we could use this rule of thumb: "An initialized late final is a cached getter".
The original argument seems a bit of a straw argument. You can't do anything with the Iterable without incurring the expense, not even ask .isEmpty
. It is better not to call foo()
at all if is not needed, and if it is needed, then the expense is unavoidable.
I have come to the conclusion that our lazy Iterables are error-prone and should only be used in a linear way: An Iterable may be iterated once, or passed to one place that also has the responsibility to iterate once. If you need to iterate twice, you need to copy to something that supports efficient re-iteration. This property could be checked by static analysis.
There was a bug in the protocol buffer libraries that was responsible for several seconds of jank in an important application. The code to clone
a message would create a new message object and add elements of a repeated field using PbList.addAll
. addAll
has this structure:
void addAll(Iterable<E> collection) {
collection.forEach(validate);
_internalList.addAll(collection);
}
This was called, in a different file, using:
newMessageRepeatedField.addAll(oldMessageRepeatedField.map((item) => item.clone()));
The two iterations of the lazy mapped Iterable caused clone
calls exponential in the depth of repeated structures.
One way† to fix this would be:
void addAll(Iterable<E> collection) {
final copy = List.of(collection);
copy.forEach(validate);
_internalList.addAll(copy);
}
CachingIterable
could have been used too, but I think the performance would not be adequate. Once we have a system list in our hands, which can be determined trivially at compile time for List.of
, the compilers can generate code where a loop is a handful of instructions, rather than a handful of calls. Forcing the representation to a system list makes subsequent operations and order of magnitude faster.
† What is unfortunate about allAll
's list copy is that it is unnecessary when collection
is already a system list, which is most of the time. The copying cost was measurable and unwelcome on these paths, so the final fix was to add .toList()
after .map()
and leave the performance hazard in place.
I would find it useful to have a family of idempotent factory constructors that copy an Iterable if it is not a system List, but return the argument if it already is suitable system List.
List.ensureUnmodifiable(Iterable<E> input)
Returns input
if it us already a List that might have been returned by List<E>.unmodifiable(input)
, otherwise returns a copy that is guaranteed to be unmodifiable. I would use this to store a local copy of an input to a multi-step algorithm to ensure it cannot be changed between calls into the steps, but avoid a copy if the input is already unmodifiable (e.g. const []
). A simple example is storing the children in an unmodifiable tree structure, so client code can walk the tree but not change it.
List.ensure(Iterable<E> input)
Returns input
if it is a system list with interface List<E>
, otherwise copies to some kind of system List. The intended use is like the call to List.of
in the addAll
example - the result is not modified by the method and the extend of the potential copy is just the call to the method.
I would totally support a lint detecting when you iterate an argument iterable more than once. It's a great way to have unexpected side-effects. That's what I generally recommend and try to enforce in the platform libraries and packages.
For the addAll
, I'd probably prefer doing:
void addAll(Iterable<E> collection) {
var preLength = length;
for (var e in collection) {
if (validate(e)) {
_internalList.add(e);
} else {
_internalList.length = preLength;
return;
}
}
(assuming the internal list is just a plain SDK list).
I'm not too psyched about making "system lists" special. You should be able to treat every List
as if it was a system list. If the object doesn't have efficient lenght
and operator[]
, it should't be a List
.
(On the other hand, you should treat every Iterable
as horribly inefficient and not iterate it more than necessary. For experts only: Don't even expect it to be finite :grin:).
What I'm hearing is that the desired behavior of the lazy/caching iterable depends on the use-case, and either can be dangerous if misused because they cache values after they're actually needed (and both cost extra allocation when not actually needed).
So, the solution would be to provide both, a CachingIterable
and a LazyList
, and recommend only using them only locally in a single function which knows whether and how it uses it.
(And LazyList
is probably unnecessary because you can just do late var lazyList = iterable.toList();
or List<T>? lazyList; ... lazyList ??= List.of(iterable)
.)
If you know you're going to do multiple partial iterations, and want to cache the values, then CachingIterable
does make sense, you just need to know when you're done using it so it can be GC'ed. I'm not sure how often that use-case occurs - if you need to iterate more than once, you usually iterate the same number of elements each time (assuming the iterable hasn't changed between iterations, in which case you definitely don't want to cache). How that number is found is likely different between use-cases, so I'm not sure it's something we can easily support generally.
@lrhn The suggestion for addAll
makes my point that we need something better. The suggestion is horribly coupled to the use of the data from the Iterable, and incorrect since validate
throws, and sometimes less efficient due to inserting one element at a time rather than as a batch. You would need to do something completely different if the underlying was a Set
. A version of addAll
that uses 'copy-if-needed' is more clearly validating-then-adding, and the code is uncoupled from the inner addAll
, so could be reused (either the actual code, or the simple pattern) in a Set implementation.
Unmodifiable/const List
s are important tool because you can reason about the elements being invariant. This should be better supported by the language and libraries.
You might not like to "make" system lists special, but they already are special if your application has constraints on performance, either runtime or memory.
It is difficult to get into the efficient zone, and some mechanisms for getting there, like using an unmodifiable List
, have other uses, like protecting against modification.
I can reasonably fake
var items = List.ensure(collection);
with
List<E> items = collection is List<E> ? collection : List.of(collection);
There is still a performance hazard because the compilers often can't tell that items
is monomorphic (try passing in "".codeUnits
and UnmodifiableListView([])
, and time your program again).
I don't know how to fake List.ensureUnmodifiable
.
On 'my' app, I could reduce memory below a hard limit by scattering List.unmodifiable(...)
all over the place, but there was copying of lists that were already unmodifiable, so further reductions could potentially be available if the corelib was more helpful.
Consider the following iterable:
This iterable is characterized by being costly to iterate, and this means that it will be a significant performance benefit to avoid running the body of
foo
at all if the resulting elements are never needed.This means that we cannot use the advice "never use
foo()
, usefoo().toList()
instead", unless we know statically that the elements will be used at least once.However, it is even worse if we execute
foo()
at some point and obtain an objectv
, andv.toList()
is executed many times. This could occur because the resulting elements are needed many times, in many different locations in a large software system, andv
is passed there in various ways (as a parameter, stored in a data structure, etc.), but there is no single location which is reached early after the creation ofv
where it is known that this iterable will definitely be used at least once.This issue is a proposal that we might benefit from having a class similar to the following class widely available:
The concrete implementation is completely minimal, but it allows for the desired effect: We can replace an expression like
foo()
that yields an expensive iterable by the expressionLazyList(foo())
, and the resulting object will performtoList()
on demand, when the first iterator is obtained from it.This means that we will avoid
toList()
entirely if the iterable is never iterated (we can pass it around just fine, we just never actually iterate it), and we can pass it around to multiple clients, and we will still only performtoList()
once if multiple clients make the choice to iterate it.Obviously, there may be a need to do many other things in order to obtain a good
LazyList
implementation: Presumably, we'd want an unmodifiable list as the value of_list
, and we'd want to return that list fromLazyList.toList()
, and there could be many other important details. But the core idea is already present in the very simpleLazyList
above.One assumption is crucial to know for users of
LazyList
, so that must be part of the documentation:It must be appropriate to invoke
toList()
just before an iteration starts; in particular, aLazyList
cannot be used on an iterable that delivers an infinite sequence of values, and it should not be used on an iterable which is intended to iterate over just a few elements and then stop the iteration.Here is an example run:
which prints: