Linkages re-computed for every scenario (including baseline) for non WALK modes

abyrd commented 5 years ago

Currently, for non-walk modes, linkages and distance tables will be recomputed separately for every scenario (including the baseline scenario).

This is because the baseline is represented in requests as an empty scenario (rather than the true baseline of no scenario), and therefore no linkages and tables are ever created and saved for the true baseline. This "false baseline" is not considered the root or basis of any other scenarios. If a true baseline linkage were created, non-empty scenarios could then build upon it and reuse its tables. Such baseline linkages would be subject to eviction though, and that eviction would be useless (from a memory conservation point of view) if all non-empty scenario linkages were not evicted along with the baseline (since they would hold references to most of its data). It might therefore be good to insert these baseline linkages into the non-evictable map alongside the baseline walk linkage. Note: the baseline linkage for the walk mode is currently handled as a special case with data saved in fields on the TransportNetwork and loaded into the linkage cache on network load. This should really be generalized to handle all egress modes.

It also might be advisable to alleviate this problem at the root by no longer lazy-building baseline linkages and distance tables, and just requiring them to be built intentionally by the user on demand before any analysis can be performed with that direct or egress mode.

When the user asks for car linkage and distance tables to be built for the baseline case, these would be stored in a separate file (not in the network file itself, to allow adding linkages after network build).

Other issues related to or caused by this problem: #448, #302, #472, #503

abyrd commented 5 years ago

Some additional notes based on a discussion today:

Eviction from linkage cache may not be effective long-term strategy for controlling memory consumption.
Currently, eviction does cap memory usage because car / bike linkages do not properly reuse table entries like walk does, so each car / bike linkage contains lots of its own data.
Generally though (for car or future correct bike/walk linkage) Cache weigher is not accurate:
Cache weight includes objects that will not be GC'ed when the enclosing linkage is GC'ed
Memory savings varies depending on whether other linkages are reusing the objects
With proper reuse of base linkage objects by scenario linkages, scenario linkages use at most a few MB.
The net effect of eviction will be forcing linkage / table rebuild, while not saving any memory.
Linkages for the same mode should mostly reuse references to the base linkage objects.
Evicting a linkage for a particular scenario does not / should not evict its base linkage.
Current big problem: non-car linkages don't reuse objects at all (due to "false baseline" scenarios and lazy linkage-building).

ansoncfit commented 4 years ago

A recent report from a user reminded me of this issue. His observation was that our system seemed to have "trouble in the afternoons."

After reviewing usage patterns and discussing with the user, it turns out he was changing street modes/scenarios. Repeated changes would trigger the need for new linkages, and a worker could eventually get bogged down re-computing linkages after they had been evicted. A contributing factor to his sense of "trouble" was likely the lack of % complete progress reporting with linkages; such updates are provided for egress cost table building, but not linking.

So, in the short term, better progress updates could help. Longer term, the optimizations discussed above would help address the root cause.

ansoncfit commented 4 years ago

Could also be addressed with task reporting #461

abyrd commented 2 years ago

We have reports of more related problems, when switching between 4-5 scenarios and two egress modes, all on the same bundle/network. A single worker machine is going through cycles of eviction and rebuilding. These are large networks and grids so that process is slow, and as mentioned in #473 progress is not reported for these secondary rebuilds. This silent delay is confusing to the end user and can be very lengthy, so they may try other settings and kick off more requests exacerbating the problem.

The eviction was originally conceived and treated as a sort of safeguard that would rarely be hit, but with increased use of egress modes and multiple scenarios, zoom levels etc. this is going to be increasingly common.

Our workers have quite a lot of memory - I propose that we create an experimental worker branch, disable eviction and somehow improve reporting on low-memory conditions, then let well-informed users opt in to using this worker so we can collect some data on whether memory exhaustion is actually a problem that needs strict guardrails, or we get better subjective results by just warning about it in rare instances when it arises.

trevorgerhardt commented 2 years ago

Wouldn't we be able to track/report memory usage data on its own first in order to be able to determine whether or not we need those guardrails before experimenting with removing eviction?

To me, the solution discussed 2022-06-15 for #473 still seems the most ideal place to start (linkage/egress table serialization) as it could speed up a variety of situations (e.g. starting up regional workers).

Additional reporting of progress (and consolidation of how we report those tasks) seems like the next least risky move. Reporting more progress data is usually a UX improvement regardless of other solutions.

conveyal / r5

Linkages re-computed for every scenario (including baseline) for non WALK modes #521