dgorissen / pycel

A library for compiling excel spreadsheets to python code & visualizing them as a graph
GNU General Public License v3.0
573 stars 152 forks source link

Handle worksheet formulas with circular references #57

Closed GeoJosh closed 5 years ago

GeoJosh commented 5 years ago

Added safeguards to prevent recursion-based crashes for worksheets that contain circular references. A workbook has been included that exhibits this behavior and proper calculations are detected. Unpublished tests also pass for internal documents.

stephenrauch commented 5 years ago

Thanks for taking the time to make this PR. I will do a more detailed review later, but for now I would suggest this functionality should try not to put much overhead in the more "normal" paths. If my inspection so far is correct, this code visits all cells in the map even if evaluating only one cell. This seems expensive.

I would suggest that a better solution may be available by explicitly finding and calculating cycles.

GeoJosh commented 5 years ago

The intention of the PR is to only iterate an internal counter for each relevant cell in a function's execution path during its evaluation phase. The only time all the cells are touched is before the evaluation to ensure that all of the internal counters have been reset. While pre-evaluation of the network graph may provide some insight into more efficiently handling formulas with circular references I think the method for generating code would need to drastically altered to account for an iterative calculation methodology vs the current traverse-first code generated.

Unlike Excel's evaluation code the generated lambda code for the functions results in a traverse-first execution path before the cells are evaluated (vs eval-first traversal) and some testing had to be done to figure out how Excel initialized the values once the iteration limit is reached. This may result is a mismatch for more complicated nested functionality, but I don't have an example of that at this time.

Thank you for your review and any suggestions for further modification or implementation.

stephenrauch commented 5 years ago

I realize all of the cells are only visited before the evaluation, but they are all visited. I have sheets with 10,000's of cells. To visit all cells, to evaluate one cell, seems less than optimum. So, if we are going to put in extra work to deal with graph cycles, I think the work should be either be during the compile, or when explicitly evaluating cycles.

Excel does the same. It requires that we explicitly enable iterative evaluation. The performance challenges associated with iterative evaluation are not imposed on everyone.

stephenrauch commented 5 years ago

Thanks again for this submission. I have merged it, but I merged another change right after it which implements a more complete solution and lowers the overhead for the non-circular case. The change in #58 has a different API than one I merged here. Except for the API change it should be a super-set of the change here.