Discussion about the traversal process

CesiumGS / cesium-native

Apache License 2.0

421 stars 212 forks source link

Discussion about the traversal process #126

Closed javagl closed 2 years ago

javagl commented 3 years ago

The current traversal procedure of tiles is - virtually by definition - the most simple traversal process that will ever exist.

In the future, there will be ...

additional culling criteria.
additional visibility checks.
additional flags that can be switched on or off.
additional tile types and providers.
refinements for the refinements.
extensions for the prioriztations of tile loading.
additional information (statistics, performance measures) that have to be collected during the traversal.
additional operations that are supposed to be done during the traversal.
additional strategies for the order in which the children of a tile will be visited.

Each of these could offer great options for new features and improvements. But it also means that the complexity of the traversal can only increase (and I'd say it's bound to increase dramatically). If there is no clear separation of concerns, no abstraction, and no clear, high-level outline for the process itself and the state transitions, the state space will face a combinatorial explosion.

But even though the current state is the most simple state that will ever exist, it's already ... complicated.

An aside: It may sound like an inappropriate and/or bold claim, but I'm pretty sure that we're already at a point where nobody knows exactly what happens during the traversal. One example for what caused my concerns here is that the traversal starts with this method:

Tileset::TraversalDetails Tileset::_visitTileIfNeeded(
    const FrameState& frameState,
    uint32_t depth,
    bool ancestorMeetsSse,
    Tile& tile,
    ViewUpdateResult& result

I tried to figure out the meaning of the ancestorMeetsSse flag. And I think this is essentially not possible. (Spoiler alert: It certainly does not say whether an ancestor meets the SSE). And when it's impossible to clearly state the conditions under which this flag, which is passed into the top-level-traversal call (!) and passed along (and changed occasionally...) is true or false, then that's not good: Any unit-test that is supposed to do something that is even indirectly affected by this flag could just capture the current state, but not say whether the current state is "right" or "wrong".

From an idealistic perspective, I wondered whether it could be possible to move towards a state where the traversal process looks more like this pseudocode...:

ViewUpdateResult updateView(ViewState viewState) {
    TraversalResult traversalResult = traverse(root, TraversalState(viewState, frameNumber));

    addTilesToLoadQueues(traversalResult.tilesToLoad);
    passTilesToRenderer(traversalResult.tilesToRender);
    removeTilesFromRenderer(traversalResult.tilesToRemove);

    return ViewUpdateResult(traversalResult);
}
TraversalResult traverse(Tile tile, TraversalState traversalState) {
    TraversalResult traversalResult(traversalState);

    if (shouldNotBeVisitedForWhateverReason(tile)) {
        traversalResult.visited = 0;
        return traversalResult;
    }
    if (!isVisible(tile, traversalState)) {
        traversalResult.culled = 1;
        butMaybeLoadSiblingsOf(tile, traversalResult);
        return traversalResult;
    }
    if (shouldBeLoaded(tile, traversalState)) {
        load(tile, traversalResult);
    }
    if (shouldBeRendered(tile, traversalState)) {
        render(tile, traversalResult);
    }
    if (shouldBeRefined(tile, traversalState)) {
        refine(tile, traversalResult);
    }
    if (shouldBeCoarsened(tile, traversalState)) {
        removeChildren(tile, traversalResult);
    }
    for (Tile child : tile.children) {
        TraversalState childTraversalState = traversalState.withDepthPlusOne();
        TraversalResult childTraversalResult = traverse(child, childTraversalState);
        traversalResult.combineWith(childTraversalResult);
    }
    return traversalResult;
}

Yes, I know that it cannot be that simple. But some rough (!) ideas reflected by this pseudocode being

There is a top-level updateView function. It receives the FrameState that is constant throughout the traveral. It returns the ViewUpdateResult that summarizes the result of the whole traversal, and contains exactly the information that is supposed to be available for clients (that's roughly as in the current implementation)
There is an internal traverse method (which might even be encapuslated in some TilesetVisitor, FWIW). It receieves a TraversalState that is constant for that particular branch. It returns the TraversalResult which summarizes the information that was accumulated while traversing that particular branch.
The TraversalResult of consists of the information for the given tile, combined with the traversal results for all its children
The conditions for when a tile shouldBeLoaded, shouldBeRendered, shouldBeRefined ... are clearly stated (in the documentation of the respective methods). More broadly speaking, the goal would be to assign clear, concise meanings to terms like "visiting", "loading", "rendering", "refining"
It should be clear where state transitions take place.

The state transitions (reflected with the LoadState and TileSelectionState::Result) already are complex, and every attempt to simplify this will require diligence, a deep understanding of the process, and the goals. But for example, stating the criterion for "rendering a tile" with a method like

bool shouldBeRendered(Tile tile, TraversalState traversalState) {
    if (traversalState.parentWasRendered) {
        return false;
    }
    if (!tile.isLoaded()) {
        return false;
    }
    if (isAnyLoading(tile.children)) {
        return false;
    }
    return true;
}

and establishing a clear connection between the decision of whether a tile should be rendered and the state change, as in

if (!tile.rendered) {
    if (shouldBeRendered(tile)) {
        render(tile);
        tile.rendered = true;
    }
}

could allow to document the meaning of "rendering", and the unit tests could be straightforward, as in

assertFalse(shouldBeRendered(unloadedTile, traversalState));
assertFalse(shouldBeRendered(tileWithLoadingChildren, traversalState));
assertFalse(shouldBeRendered(tile, traversalStateWithRenderedParent));
assertTrue (shouldBeRendered(tile, traversalState));

assertFalse(tile.rendered);
render(tile);
assertTrue (tile.rendered);

In contrast to that, and referring to the (maybe overly specific) example above: The ancestorMeetsSse flag will (among other things) affect whether a tile will be "rendered". And whether it is true depends (among other things) on whether a child of the tile is "not renderable". And a child tile is "renderable" when...

this->getState() >= LoadState::ContentLoaded &&
(!this->_pContent || this->_pContent->model.has_value()) &&
!std::any_of(this->_rasterTiles.begin(), this->_rasterTiles.end(), [](const RasterMappedTo3DTile& rasterTile) {
    return rasterTile.getLoadingTile() && rasterTile.getLoadingTile()->getState() == RasterOverlayTile::LoadState::Loading;
});

I cannot imagine how to write a sensible test, stating the conditions under which a certain tile will be "rendered", and achieving a sensible coverage of the state space that determines whether these conditions are actually met or not.

I know that there is nothing harder than trying to make complex things simple. But I think it is crucial to at least try to work towards such a simpler, clearer state. I also know that what I wrote here may appear to be naive or overly idealistic, and it does not allow deriving immediate actions. But maybe others can share their thoughts or ideas about the possible paths forward.

kring commented 3 years ago

@javagl there is so much to unpack here that it's hard to know where to begin, so let's start with one of the first claims, which I believe is objectively false:

(Spoiler alert: It certainly does not say whether an ancestor meets the SSE).

But... that's exactly what it means.

The one and only place ancestorMeetsSse is set has a large comment explaining why:

            // Otherwise, we can't render this tile (or blank space where it would be) because doing so would cause detail to disappear
            // that was visible last frame. Instead, keep rendering any still-visible descendants that were rendered
            // last frame and render nothing for newly-visible descendants. E.g. if we were rendering level 15 last
            // frame but this frame we want level 14 and the closest renderable level <= 14 is 0, rendering level
            // zero would be pretty jarring so instead we keep rendering level 15 even though its SSE is better
            // than required. So fall through to continue traversal...
            ancestorMeetsSse = true;

I'll try to state that another way: normally, as soon as a tile is deemed to meet the required screen-space error, we stop refining and render that tile. But, in certain situations (which are admittedly complicated, as implied by the comment and further confirmed by the code), it's necessary to refine a tile even though it meets the SSE. When we do that, we pass this flag down to the recursive selection algorithm to indicate that one of our ancestor tiles (parent, grandparent, etc.) has already met the SSE and we're still refining anyway.

Admittedly it's a little confusing that we're setting the passed-in flag to true, rather than setting a new ancestorMeetsSseFlagToPassToChildren or somesuch. That might imply to a reader that an ancestor of the current tile meets the SSE, when really it is the current tile that meets the SSE. I guess that's what you're referring to when you say "it certainly does not say whether an ancestor meets the SSE". Because that flag is not used after the place it's set, other than to pass it to children, shortcuts were taken. I thought it was clear enough given the comment, but certainly it's sloppy and fixing that would be an improvement.

Just in case there's any doubt, do keep in mind that setting the value of a parameter that is passed in by value (as we're doing here), does not affect the value at the call site. We're modifying a copy and then passing the copy to our children.

Anyway there's definitely more to say based on what you've written above, but do we agree on this so far at least?

javagl commented 3 years ago

Maybe I shouldn't have pointed out that particular example. It was one example of something where a hidden complexity becomes apparent that IMHO should not be there. One might alleviate the problem for this particular case (maybe with something as simple as a different name, ... ancestorMeetsSseOrSomeHigherLevelWasRequestedButThereIsNoHigherLevelAvailableButChildrenAreCurrentlyNotRenderableBecauseTheirRasterOverlayTilesAreStillLoading.... ). We also could talk a while about that particular flag, to carve out its meaning, and try to simplify this particular case. But that would miss the point: I tried to understand what this flag means, when it is true or false, and I plainly could not figure it out. The condition for the flag becoming true during the traversal is if (meetsSse || ancestorMeetsSse || waitingForChildren), and waitingForChildren is true when one child is not "renderable", involving the complex and non-intuitive meaning of "renderable" that was shown in the last snippet. Imagine you tried to write Doxygen docs:

@param ancestorMeetsSse A flag that indicates....

... What? I'd really like to see a one-line description of the meaning of this flag, and/or a unit test where it 1. is true and stays true, 2. is true and becomes false, 3. is false and stays false, 4. is false and becomes true. Otherwise it's just that: A flag that is passed along, sometimes true, sometimes false, without any tangible or testable meaning...

The broader point was: The process (including the conditions for visiting, loading, rendering, refinement, and coarsening, and the state transitions) should be described more modularly, in a form that is easier to understand, verify, and test. There may be complicated cases that are hard to articulate in such a simple form. But I find it concerning that even a seemingly simple condition (as I said, in the simplest process that we'll ever have to deal with) is already hard to describe now.

But maybe that's just naive. If the process cannot be described in a simpler, clearer, more easily testable form, then that's probably something that we just have to cope with.

kring commented 3 years ago

There is no question that this can be coded better. That said, I wasn't trying to be obtuse. I tried to make a complicated thing as simple as I could, given the limits of my ability and the time I could allot to it. I mention this not to be defensive (even though it probably comes off that way), but just to point out that you need to make concrete suggestions for improvement, not just complain that it's complicated and hard to understand.

In the abstract, your suggestion of embodying a higher-level view of the process in the code is rational and a great goal. And I don't doubt for a second that it's possible to at least get closer to that ideal. And I support doing that, even though I can't dedicate the time to it right now.

But the specific case you're ripping on, ancestorMeetsSse, is kind of baffling. "A flag that is passed along, sometimes true, sometimes false, without any tangible or testable meaning" is an astounding mischaracterization. It has a crystal clear, tangible, meaning ("we're visiting this tile even though one of its ancestors already met the SSE requirements"). The "we're visiting this tile even though" bit is implied I guess, expecting that the reader knows that in hierarchical LOD we usually don't visit child tiles of a tile that meets the SSE. As for why we're visiting a subtree when an ancestor has already met the SSE, that is also tangible and testable, even if it's complicated. It's described by the comment I pasted above.

For doxygen of this parameter, I might say: "Normally when a tile meets the required SSE, we don't visit its children. However, we sometimes need to do so in order to avoid detail blinking out of existence when the camera moves and ancestor tiles are not yet renderable. This flag indicates that an ancestor met the SSE requirement so we're only visiting this tile because we're in one of those situations."

The specific situation where we need to do this (i.e. "why is this flag ever true?") is in the comment I pasted above. At the code level, even, it's fairly straightforward. ancestorMeetsSse is true for a subtree when meetsSse || ancestorMeetsSse || waitingForChildren is true (this condition could collectively be described as "we've decided this is the right tile to render for this subtree") and !renderThisTile is true (this condition could be described as "but sadly we can't render this tile because it's not renderable yet and rendering nothing here would cause us to lose detail compared to what we rendered last frame". In other words, we continue refinement past a tile that meets the SSE when "we've decided this is the right tile render for this subtree but sadly we can't render this tile because it's not renderable yet, and rendering nothing here would cause us to lose detail compared to what we rendered last frame."

I'd argue that all that information was in the code, but not stated all in one place or so succinctly. Would that have helped your understanding?

I guess maybe I'm still missing the point you're making, which is "this is complicated and I didn't understand it." Well, to that, I say, "yes, it's complicated, yes we should make improvements as we see them to make it less complicated or easier to understand, but no I don't think it is incomprehensible." So let's make improvements, but let's not start from a place of "OMG this is chaos!" because it's not. I guarantee I can give you a succint explanation of what every parameter or property or conditional means and why it is used the way it is, even if it's not well documented or perfectly clear upon a quick reading of the code.

Ok, I'm done being defensive now (for now! 😆). I think it would be very useful and interesting to try to restructure the code along the lines you've stated in the original issue, and see where we can get to. I think that needs to happen after our initial release though. @baothientran I know you're starting to write tests for this, so if you're facing the same difficulties that @javagl mentions, and it can't be quickly resolved by asking me, then it might be best to let me write the tests instead. Once those are in place, we'll be in a much better place to refactor this algorithm.

baothientran commented 3 years ago

@kring Nothing major comes up so far for me. The only small hurdle was the concurrent code, but I was able to mock it to be a single thread, so it should be good. I will let you know if something comes up.

javagl commented 3 years ago

There is not so much that would require being "defensive" (or consider something as "offensive", if that's supposed to be the opposite), at least from my point of view. But if it is, my "defensive" position: I tried to articulate (and justify, technically) my concerns about the complexity of the traversal process and the conditions and state changes that it involves. I tried (but maybe failed) to not phrase it as a rant, as in "OMG this is all chaos". Reasons for the intricacies of the current state do not matter so much either, as long as there is some consensus that it could be (or should be, or has to be) reviewed. (And the reasons for me not understanding the process may as well be that I'm stupid and lazy and incompetent *shrugs*).

One difficulty for me is to be "constructive": On the one hand, the process is complicated, and I don't understand it. On the other hand, making specific suggestions (on a level where they would allow to derive actual action patterns (as in "code changes")) would require a deep understanding.

This has a strong connection to testing: Each change that I could suggest could change the behavior in an undesirable way, and I don't see a way to reliably make (or even suggest) specific changes, without risking to break things. For example: The _visitTile method originally had ~200 LOC. "Visting" apparently means rendering, loading, culling, refining, and/or coarsening/kicking descendants. The input involves the FrameState, the Tile, my beloved ancestorMeetsSse, and others. The method modified the Tileset, the given Tile (and possibly other tiles), the ViewUpdateResult, and the TraversalDetails. For me, this means that any "unit test" right now can only consist of setting up one configuraton, calling this method, observing the result, and carving this result into stone via some assertions. Nearly every change would break such a test (which could be called "overfitting"). I cannot imagine stating sensible pre- and postconditions, testing one aspect of the behavior, or making sure that a refactoring (even if it might change the behavior) does not "break" the desired behavior. But I would really like to be proven wrong about that point.

^{Details and examples - skip them if you like:}

I tried to be constructive, though: I read through the code and (locally) tried to move towards a state that I'd consider more comprehensible and testable. As an overly specific, low-level example, I looked at the method

bool Tileset::_meetsSse(const ViewState& viewState, const Tile& tile, double distance, bool culled) const {
    // Does this tile meet the screen-space error?
    double sse = viewState.computeScreenSpaceError(tile.getGeometricError(), distance);
    return culled ? 
           !this->_options.enforceCulledScreenSpaceError || sse < this->_options.culledScreenSpaceError :
           sse < this->_options.maximumScreenSpaceError;
}

and tried to imagine (or rather: actually write) a comment for this. It could have been

/*
 * Returns whether the given tile meets the default `maximumScreenSpaceError` that was
 * specified in the global options, or it was `culled` and the `enforceCulledScreenSpaceError` 
 * in the options is `false` or the tile meets the `culledScreenSpaceError` from the
 * global options. 
 */

But when a method is called meetsSse, the comment should IMHO say /* Tests whether the given tile meets the given SSE */. Not more. Not less. Changing this to

bool Tileset::_meetsSse(const ViewState& viewState, const Tile& tile, double distance, double maxSse) const {
    double sse = viewState.computeScreenSpaceError(tile.getGeometricError(), distance);
    return sse < maxSse;
}

and changing the call to

    if (culled && this->_options.enforceCulledScreenSpaceError) {
        meetsSse = _meetsSse(frameState.viewState, tile, distance, this->_options.culledScreenSpaceError);
    } else {
        meetsSse = _meetsSse(frameState.viewState, tile, distance, this->_options.maximumScreenSpaceError);
    }

made the call appear to be more complicated, unfortunately, but would give the method a clearer meaning and make it more testable (and avoid unnecessary calls to computeScreenSpaceError for the culled && !enforceCulledScreenSpaceError case, by the way).

I went on to the ancestorMeetsSse flag, and added something like the ancestorMeetsSseFlagToPassToChildren that you mentioned, but wasn't able to go far beyond that, because I failed to understand its meaning in a depth that would be required to confidently simplify anything there.

In view of the pseudocode that I posted, one part of my attempts also aimed at not enqueueing the tile-loading tasks during the traversal itself, but rather collect them in the traversalDetails and move them into the queues later. As part of that, I tried to understand the concept of "kicking" tiles, involving things like wasRenderedLastFrame and wasReallyRenderedLastFrame (?). My thoughts here could roughly be articulated as the question: "Why are tiles first enqueued or added somewhere, and then kicked? Wouldn't it be possible to just not enqueue them in the first place?". But ... it's complicated. Tiles are added to the load-queues, and removed again later. Does that mean that I could omit the enqueueing of the tiles? Maybe. Maybe not: The state of the tiles is set to RenderedAndKicked or RefinedAndKicked in this process. Based on a quick full-text search, this state does not seem to be checked explicitly anywhere (except for prior to setting it). I guess that this state has some implicit meaning, as in "The state is not Rendered (maybe RenderedAndKicked, or something else, but not Rendered"). Does that mean that I could combine these into a single Kicked state? Maybe. Maybe not...

Long stoy short: I had to "zoom" into branches of the code where I couldn't be sure whether they depend on a state that I don't know, or whether they modified a state in a way where I couldn't foresee the implications of changes. This process felt rather ineffective, and I realized that my attempts could not be compatible with the efforts of getting out the release (including, but not limited to the efforts that go into writing unit tests for the current state).

I hoped that this "discussion" issue could be a place to draft some thoughts in a "top down" style, and maybe identify some caveats that stand in the way of the naive, idealistic pseudocode sketch from the first comment.

A short note regarding ancestorMeetsSse: From my understanding now, it could be described as "Enforce a traversal (visting/loading/rendering/refining) of the children, because the tile that is actually supposed to be rendered is not completely loaded yet". Of course, there's the famous quote...

There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

So we could talk about calling it enforceTraversalForUnrenderableParent or so. But the exact conditions for this flag being true or false are still not trivial.

Far more abstractly, I wondered whether the fact that the "ancestor meets the SSE" is relevant, or the fact that the children should be traversed, as in the difference between

if (car.speed > limit) tooFast = true;
...
if (tooFast) car.brake();

and

if (car.speed > limit) haveToBrake = true;
...
if (haveToBrake) car.brake();

and whether things might become clearer by tracking some int deepestAvailableLevel and checking

if (desiredTile.isRenderable()) {
    // Yay, it's there!
    render(tile); 
} else {
    // Oh dear! 
    if (state.deepestAvailableLevel > depth) {
        // Well, at least we can use the finer state as the fallback
        render(desiredTile.children); 
    }
}

or so, but that all this would require more thought...

kring commented 3 years ago

Hey Marco, I'm sorry but I really can't take the time to dive into this and respond in detail to your latest post right now. I've read it, and it's a mix of very useful insights, and also a few moments where you're missing a key piece of information that would change the story significantly. I want to respond to it, because if nothing else it would help develop our collective understanding of how things work. But I'm going to have to come back to it later.

I just want to be clear, though, that I don't think your difficulty understanding this in any way implies a failing on your part. It's complex. It represents at least 10 years of evolution, starting with Cozzi's thesis and the virtual globe book and then probably at least three major iterations in CesiumJS. Not to mention roots going back to STK. And it's certainly not as clearly-written as it could possibly be.

I probably reacted more harshly in my last responses than I should have. Some of your comments rubbed me the wrong way, but I think your intentions are good and your push to improve this (and everything else) is appreciated.

Finally, I do think it is quite possible to write sensible tests for this. Tests that check for desired behavior, not just regression tests that capture current behavior and make sure it doesn't change in the future. But, to be clear, I'm not asking you to do it! I've long assumed I would have to. Bao kindly volunteered to do so last week, but I'm fully expecting him to tell me before long that I'm crazy and to do it myself! :)

javagl commented 3 years ago

I know that the issues like this tend to not have the highest priority. Particularly not shortly before a release. You already mentioned that whatever might be done here, it will happen after the release. But starting to think about that after the release will likely be too late: Carving out such ideas takes time, likely some iterations (and maybe some failed attempts), and it requires knowledge about "why things are the way they are" to be spread among those who are supposed to contribute. As such, and as I mentioned, this issue is intended for exactly that: Identifying possible blockers for future extensions (like the ones in the bullet points of the opening post), caveats that may prevent certain (naive) simpler solutions, and sketching possible approaches for keeping the increasing complexity manageable.

(I've seen enough cases where not enough emphasis was put on abstractions, separation of concerns, and "keeping things simple" in general, with the result of technical debt being accumulated (aka: "that's legacy code"....), complex issues piling up, and the development coming to a grinding halt. As such, an implicit goal is achieving a state where someobody can, for example, implement a new culling criterion, and (idealistically) add it with cullingChecks.push_back(myNewCriterion), without having to understand each and every detail of each and every other culling/refinement/rendering/traversal strategy, their parameters, and their state transitions. Or to put it that way: People should spend more time with writing code than with reading code. Nobody likes reading code...)

kring commented 3 years ago

@javagl what do you think we should do with this issue? Here are some actions I can think of that we might derive from it:

Have more detailed discussions about how tileset traversal works and how its implementation can be improved. I think we might be near the limit of what we can achieve on GitHub, though, and maybe need a live conversation and a whiteboard to really take it further.
Refactor the tileset traversal for clarity, without changing how it works. This is safer now that @baothientran has implemented some unit tests for it.
Write some documentation of how the tileset traversal works. I'd probably be best placed to do this.
Start designing a next-gen, pluggable selection/loading algorithm. A lot of thought would need to be given to requirements, first. i.e., what problems are we actually trying to solve. And it would need to be closely aligned with the implicit tiling / VR tile selection work happening elsewhere in Cesium.

Any others? I think we can write actionable issues for 2 and 3. 1 could be worth doing but doesn't need an issue. I'm hesitant to embark on 4 at the moment, and I don't think we need an issue for it in the meantime, but might change my tune in the next few months.

javagl commented 3 years ago

As mentioned above:

On the one hand, the process is complicated, and I don't understand it. On the other hand, making specific suggestions (on a level where they would allow to derive actual action patterns (as in "code changes")) would require a deep understanding.

Or to put it that way: I don't think that these 4 points can be strictly separated. The next-gen traversal (4) might be the ultimate (and somewhat abstract) goal, but placing a bunch of engineers into a room with the task "Solve this problem, and report when you're done" is not gonna fly, for many obvious reasons. Regarding the documentation (3): There are different kinds of documentation (one could call them "levels"). "Low-level" documentation (// Comments and /* doxygen */) only make sense when they convey relevant knowledge. (I remember writing the API docs, and felt uncomfortable when I had to "document" functions like queueLoadOfChildrenRequiredForRefinement() with a comment like /* Queue load of children required for refinement */, without really saying anything...). Creating a "high-level" documentation (on the level or a markdown file) for existing code (in hindsight) is hard, and almost certainly not really useful: While one might have an idealized idea about the process on the level of nice diagrams with boxes and arrows, this rarely tells the truth. I mean, I could document the current process, and still wouldn't know how to fix a certain bug...

However, tl;dr, to have the chance to derive some actions:

I think that refactoring for clarity (2) is a necessity, if anything substantial is going to happen with the code at all.

This requires an understanding that could be achieved with (1), adding documentation (3), and keeping in mind the next-gen traversal (4) (because that's the ultimate goal that we should be moving towards - in steps, because there is not other option).

I tried to make some suggestions above, about "what the code could look like" after the refactoring. Other suggestions may be a bit hidden, in https://github.com/CesiumGS/cesium-native/pull/94#discussion_r547997889 , slack comments, and the branch that I mentioned in an earlier comment (and that's likely very out of sync in the meantime).

But to summarize a few points that I'd consider as crucial:

Shorter methods that do one thing only, with and fewer parameters (parameter objects where necessary)
Get rid of global state accesses and global state changes during traversal, i.e. the this->options and the three load queues....
Make the state clearer. When a Tile is in a certain Tile::LoadState (or a RasterOverlayTile is in a certain RasterOverlayTile::LoadState...), or a tile is in a certain TileSelectionState, it must be clear
- How did the tile get into this state?
- What does that state say about the properties/fields of the tile?
- What are the conditions for a tile going into another state?
- What is the (in the best case: single) method that changes that state?
- Which thread is causing the state change?
(This seemingly covers only the part that is stored as an explicit state variable, but also refers to "state-like functions", like isRenderable...)

If this is desired, some of that could be extracted into issues (like "Avoid accessing the options during the traversal"), but depending on the granularity and the actual task, this may not always be nice pairs of "An issue" and "A PR that fixes that issue".

kring commented 3 years ago

Creating a "high-level" documentation (on the level or a markdown file) for existing code (in hindsight) is hard, and almost certainly not really useful: While one might have an idealized idea about the process on the level of nice diagrams with boxes and arrows, this rarely tells the truth.

I don't think I agree here. I think the high-level documentation would be very useful and truthful enough, even if it misses some corner cases. Obviously it needs to be written by someone that already understands the algorithm pretty deeply.

Shorter methods that do one thing only, with and fewer parameters (parameter objects where necessary)

Shorter methods can absolutely be an aid to understanding, but not always. Certainly, when we can break out a bit of code into a short method with a clear name that will tell anyone reading it exactly what that method does, such that someone reading code that uses that method will immediately know what's happening there and won't have to go look at the new method, we're winning! We've just made our code easier to read.

On the other hand, if we break out a bit of code into a new method, but it's difficult to capture what that method does with a succinct name, we've actually hurt readability even though we've replaced one long method with two shorter ones. Anyone reading the code will need to jump around to understand what is happening, rather than reading a linear process in a linear fashion.

Similarly, parameter objects are great when we can capture a group of parameters together in a meaningful abstraction. But taking ten parameters and turning them into ten fields in a struct actually hurts readability rather than helping it, because someone reading the code will need to jump to the struct to see what is being passed.

My point is, this is a balance. More, shorter methods is not always better than than a smaller number of longer ones. The goal should be clarity not method brevity. I'm definitely not claiming the selection algorithm has this right everywhere, of course.

Get rid of global state accesses and global state changes during traversal, i.e. the this->options and the three load queues....

Why is it a problem to access options during the traversal? What would you do instead, pass the traversal-affecting options (which is most of them) through the recursive function calls?

Make the state clearer.

Sounds great. I've thought about having each state of a tile be represented by a separate object, e.g. in a std::variant or something. So that the current state is explicit, and the data associated with the tile in each state is clear as well.

Or perhaps take a more data-oriented design approach where a "tile" is just an ID, and there are separate slices of data associated with that ID based on its current state(s).

There are lots of details to work out, though.

I think that refactoring for clarity (2) is a necessity, if anything substantial is going to happen with the code at all.

This requires an understanding that could be achieved with (1), adding documentation (3), and keeping in mind the next-gen traversal (4) (because that's the ultimate goal that we should be moving towards - in steps, because there is not other option).

This sounds like a good plan. Some thoughts:

I should write the high-level documentation, and I think it would be unwise to start significant refactoring before this is in place.
I can't tell you exactly how the refactoring should proceed, but once you (or whoever) has done something I'm likely to have opinions on it, and it'll probably involve some rework and be really annoying (e.g. my rant about method length above). Sorry about that.
Beyond basic refactoring for clarity, I think we need to figure out some concrete, long term goals/requirements. For example, refactoring to allow different parts of the process to be pluggable/customizable could be really useful, or it could just be a bunch of architecture astronautics that makes that code more complicated and slower without any tangible benefits.
As always, we should consider the priority of this relative to everything else we have going on. My gut is that unless we're planning to make significant changes to the selection algorith in the short/medium term (see 3 above), or think we could get a significant performance improvement, or it would help fix significant bugs lurking beneath the surface (like the missing tile flicker for instance), it's hard to justify major changes to it. Our energy is far better invested delivering features that our users really need, rather than making the selection algorithm prettier.

javagl commented 3 years ago

if we break out a bit of code into a new method, but it's difficult to capture what that method does with a succinct name, we've actually hurt readability

Similarly, parameter objects are great when we can capture a group of parameters together in a meaningful abstraction. But taking ten parameters and turning them into ten fields in a struct actually hurts readability

If it's difficult to capture what the method does with a succinct name, or (as a special form of that) if it's difficult to "summarize" (and name) the data that a method operates on, then that's a clear indicator (for me) that something is wrong. And that's one of the reasons why writing doxgen docs is good, not only for the reader, but also for the writer: If a class/method cannot be described with few, clear, concise statemends, then that's the point where I tend to take a step back, squint my eyes, ask myself "So what's really going on there?", and take the refactoring to the next coarser of granularity.

But moving back to the actionable items:

I think the high-level documentation would be very useful and truthful enough,

It can be, for sure. I just mentioned that because 1. I've seen enough "diagrams", where people summarized complex applications (literally on the level of [Server] <--- communicates with ---> [Client]), and 2. the devil is in the detail.

The "detail" here refers to the states and state transitions. I'm not sure whether I understood the part of the std::variant correctly. It sounds like

class Tile {
    class TileBeforeLoading { Request request; }
    class TileAfterLoading { Response response; }

    std::variant<TileBeforeLoading, TileAfterLoading> state;
}

but that's probably not what you meant. (I still consider variant to be clumsy and UN-object-oriented in many ways...)

Why is it a problem to access options during the traversal? What would you do instead, pass the traversal-affecting options (which is most of them) through the recursive function calls?

I think we already talked quickly (but only in Slack, IIRC) about separating 1. the tileset and 2. the "traverser". I have a "strictly pragmatic" view on what a class should be, and what belongs into the class. And from that viewpoint, I'd say that something like bool preloadAncestors (and in fact, most other options) is simply not part of the Tileset (and its state). The tileset is the tileset. Whether or not ancestors should be preloaded is a question for the traversal.

And on a lower level: Whenever I see a bool that governs part of a behavior, then I strongly consider to model that via polymorphism. Using that specific example, in an overly suggestive form: I could not introduce a bool preloadAncestors without considering the alternative of having some preloaders.push_back(new AncestorPreloader()). I know, it sounds a bit controversial, but I can elaborate that if necessary. Taken too far, this can lead to "over-engineering". But the right amount of that makes the difference between "writing code that does something" and engineering.

The summary:

... probably involve some rework and be really annoying (e.g. my rant about method length above). Sorry about that.

When I have the choice between ...

a complex change/PR being merged after 10 minutes, with the comment "LGTM"
a profound, technical discussion about the pros and cons of the chosen approach, and possible alternatives

... then I'd not only prefer, but in some way actually expect the latter. Good software does not "happen automatically".

... Our energy is far better invested delivering features that our users really need, rather than making the selection algorithm prettier.

If a refactoring/simplification can lead to a state where the question of "Why does this tile flicker?" can be answered with "The reason can be A or B, and we can find the definite answer by setting a breakpoint", instead of spending several days with black-box-debugging, then that's a definite win. And if it brings us closer to a state where addressing one of the bullet points that I mentioned in the first post does not cause us to start sweating and think "Oh dear, we have to change something in the traversal!" (c.f. Lava Flow), but allows us to confidently ask: "Yeah, which additional feature do you want?", then that's also a win.

javagl commented 3 years ago

The fact that the ITileExcluder just randomly popped up in this commit directly to main, without any PR, planning or discussion, is very, very unfortunate.

First of all: Handling the list of excluders via the TilesetOptions is just not good, period. The TilesetOptions are an anti-object-oriented structure that should be abandoned as soon as possible. It is a "global state" that combines things that are completely unrelated. Having random accesses to different aspects of state, scattered through the whole traversal, makes the code far more complex than it should be, and severely reduces the maintainability and extensibility.

Where and how this excluder is queried is something that should be thought through more carefully. One could certainly make a point for having ITileExcluder instances that are queried during the traversal, and others that are only queried only once, when a tile is first encountered. (They could then store the result of the query, maybe in some sort of map<TileID, bool> (aka set<TileID>)). This could be done to avoid doing the (possibly complex) exclusion check repeatedly. And the polygon-based check is tremendously complex, so that should at least be considered.

^{Yes: What happens when the polygons change? Then the stored exclusion results will have to be invalidated. This has to be kept in mind, but should not be a blocker}

The ITileExcluder interface itself should have existed before even the first line of traversal code was written. The whole traversal is about "including" or "excluding" tiles, so even with zero knowledge about what has to be implemented there, one could define this interface and the code of the traversal based on that class. This could include what I'd call some "default infrastructure classes" (like a CompoundTileExcluder that combines several delegates, and returns true/false depending on the results of the delegates). And when examining that more closely, one might notice that the frustum culling is an ITileExcluder. And the fog is an ITileExcluder. And every other culling criterion is an ITileExcluder.

^{And sure, sometimes there may be additional information required, like the ViewState. Then introduce a IViewBasedTileExcluder and head on...}

If this was crafted more carefully, then the cullingChecks.push_back(myNewCriterion) that I sketched in a comment more than half a year ago could already be in reach. Now, we can do a tilesetOptions.excluders.push_back(rasterizedPolygonsTileExcluder), but it is offered in a form that will increase the efforts for everybody who wants to bring in further improvements or generalizations here, and make future refactorings more necessary and more complex at the same time.

kring commented 3 years ago

The fact that the ITileExcluder just randomly popped up in this commit directly to main

No, it was added in this PR: https://github.com/CesiumGS/cesium-native/pull/326

Needless to say I disagree with most of this, but thanks for your thoughts.

kring commented 2 years ago

Lots of interesting thoughts here, but since in its current form it's not something that can ever be actioned / resolved, I'm going to close this.

javagl commented 2 years ago

Quoting from another comment :

Everyone is encouraged to improve the quality of the code along these dimensions whenever they run into an opportunity to do so

When there is the chance to do a "small" improvement that opens rooms for further exploration (like replacing inlined code for one particular way of computing the SSE with an interface like SssCuller that could eventually allow the cullingChecks.push_back(myNewCriterion) that I mentioned earlier), then this ~~could~~ should be done.