catmaid / CATMAID

Collaborative Annotation Toolkit for Massive Amounts of Image Data
https://catmaid.org
GNU General Public License v3.0
185 stars 56 forks source link

The confidence-compartment-subgraph returns way too much data in the JSON #1249

Open acardona opened 8 years ago

acardona commented 8 years ago

The premise of the "basic_graph" was to return a minimal representation. Now it returns an array per entry, containing the distribution of each synapse in the edge into their confidence values.

While I understand the need for homogeneity in the responses, this change broke a number of scripts using the default basic graph for analysis. Changing the structure of the returned JSON of a server function is not a good idea. Instead, add a new function. The CATMAID client is not the only client anymore.

Additionally there is little point in a gigantic JSON file containing a 5-entry array for every single synapse in the neuron, when none of these are to be displayed or used in any way in almost all cases. It adds overhead to the server and overhead to the client both for processing and for memory storage.

The general rule when additional information is necessary to construct the graph is to fetch that additional data. For example when splitting neurons into axon and dendrite in the graph. For the case of the confidence filter, this would mean fetching only when changing the confidence to a value other than the default and then storing the extra data for further filtering to other values. And fetch ONLY what is needed, as opposed to the current situation where confidence values are fetched for ALL connectors related to the loaded neurons. And these data would be reloaded when pushing "Refresh". In other words, pay the price only when necessary.

I advocate for the return of the basic representation as a separate REST API entry point, and for slimming down the current one to that basic graph when no filtering is to occur.

acardona commented 8 years ago

To further add that the graph widget's updateGraph function (at least used to) operates under a filtering paradigm: if specific information exists (like a skeleton belongs to a group) the graph fetched from the server is edited to reflect the change. For the confidence, all that would require then is to edit the weight of the edge by the stored values about that edge. Fetching the info would be triggered by the first time the pulldown menu selects a confidence larger than 1 (by checking if the object with that data exists). Adding neurons to the graph would refetch this data.

aschampion commented 8 years ago

For the sake of ticket context, part of your email:

I would like to revert eda88b295a9399, which introduced the confidence filter in the graph widget. While I haven't tested yet whether this commit broke the filtering of edges by number of synapses when neurons are split, I object to the way this feature is implemented.

...

Then I would like to implement synapse filtering by confidence in the following way:

  • When fetching edges, return the edge split into the array of five synapse counts, one per possible confidence value.
  • Supplement the split neurons data to contain confidence values.

What was the rationale for implementing the confidence filtering like you did, Andrew?

Do you foresee any issues with the way I propose to implement it?

At the moment the Graph Widget is broken in ways that hinder my ability to make figures for papers. I must to solve this ASAP. Additionally, the graph widget "basic graph" server-side function is used by scripts that are now broken. Breaking that simple API seems unjustified, when a new one could have been added or the existing one could have been extended to provide the additional information without breaking backwards compatibility.

As requested graph-widget-confidence-removal is a branch from c9b295a with 776f784 and eda88b2 reverted. I'm assuming neuroncean is near c9b295a so can merge this easily to fix the acute problems, xor merge the #1251 fix below.

1255, #1253, #1252 are reproducible with this branch so likely are unrelated to these changes. #1251 is a simple fix in e33a2d8 (which can also be cherry-picked back to c9b295a without conflicts).

It adds overhead to the server and overhead to the client both for processing and for memory storage.

No additional rows are fetched by the DB; none of the queries are index-only scans, so selecting a non-index row (confidence) doesn't change their execution. The only overhead is the response list construction Django (which was negligable in benchmarking), size on wire (which is already small and the confidence is trivially compressible in most cases), and parsing on the client. The last could be a concern, but given it's only a 5x difference on graphs that are order <100K entries, it's still order ms on clients for giant graphs, dwarfed by the processing and render time. I checked all of these at the time of implementation.

The general rule when additional information is necessary to construct the graph is to fetch that additional data. For example when splitting neurons into axon and dendrite in the graph. For the case of the confidence filter, this would mean fetching only when changing the confidence to a value other than the default and then storing the extra data for further filtering to other values. And fetch ONLY what is needed, as opposed to the current situation where confidence values are fetched for ALL connectors related to the loaded neurons. And these data would be reloaded when pushing "Refresh". In other words, pay the price only when necessary.

This is at odds with the above statement about overhead. The price has always already been paid; the relevant rows have been fetched in confidence-compartment-subgraph. In the current solution there is only the marginal cost of serializing 5 values per entry instead of 1, which given the size of IDs is on average less than 30% increase in response size by bytes.

This solution would then pay the price not only again, but a much higher price of fetching, decoding, arbor-assigning and aggregating each synapse individually through compact-arbor. However this does seem to be more compatible with how GroupGraph does other filtering.

Breaking that simple API seems unjustified, when a new one could have been added or the existing one could have been extended to provide the additional information without breaking backwards compatibility.

This is a fair point. I updated all existing consumers/callers but was not aware this was used outside of the codebase, especially since much of graph.py/graph2.py is dead code. The confidence-binned counts could just be moved to entry 4 of the response, leaving entry 3 alone, or just be activated by a flag in the request. It at least seemed within the purview of a confidence-based compartment graph to also include confidence about connector annotations.

Ultimately I have no skin in this and the graph widget is your expertise. I chose what seemed to be the least invasive approach and one that injected filtering at the same locations as synapse count, since it was the most semantically similar existing operation. However I do not often work in GroupGraph so if this is at odds with the rest of its filtering paradigm (which after poking around updateGraph more I can recognize) I will happily revert eda88b2, 776f784, and e33a2d8 in dev and can either put a reimplementation in my backlog or leave it to you.

acardona commented 8 years ago

Andrew,

Thanks for the fixes and for preparing a branch with the confidence edits reverted--you really walked the extra mile and I appreciate that. Given time constraints, your solution is likely to stay, and as you point out, the performance cost is negligible for our current uses. And it's easier to change the scripts than to deal with updateGraph.

Regarding your comment "In the current solution there is only the marginal cost of serializing 5 values per entry instead of 1", this is at odds with the current situation where, upon adding a single neuron to the graph widget, all its synapsing partners are fetched. Before, only edges between loaded neurons would be loaded. So it went from a one or two line reply when loading two neurons, to a ~4 Kb reply with data that is likely not to be used.

More generally, the updateGraph function has been in need of refactoring for years. One way to go about it would be to make it more evident where the filtering occurs by creating functions that are called in sequence, similar to e.g. the existing _regroup function. This would reduce the line count, encapsulating big chunks into thematic edits to the graph received from the server. Eventually the technical debt will have to be addressed if updateGraph is to grow any further.

aschampion commented 8 years ago

Regarding your comment "In the current solution there is only the marginal cost of serializing 5 values per entry instead of 1", this is at odds with the current situation where, upon adding a single neuron to the graph widget, all its synapsing partners are fetched. Before, only edges between loaded neurons would be loaded. So it went from a one or two line reply when loading two neurons, to a ~4 Kb reply with data that is likely not to be used.

I wasn't aware of this -- thanks for pointing it out. Can see now why this would be annoying. Will look into it tomorrow.