avoid processing vast empty areas of buggy / strange workbooks

lindsay-stevens commented 2 years ago

somehow it happens that spreadsheets sometimes think they have data for 100's or 1000's of rows/columns even though all cells are empty.
xls / xlsx processing refactored to stop processing after 20 adjacent empty columns (when getting the headers) or 20 adjacent empty rows (when reading rows). That is to say, the practice of inserting a couple of empty rows or columns for readability / formatting purposes won't ruin the form. It'd only be a problem if users insert patches of 20 or more empty columns / rows amongst the form design area.
added tests for xls / xlsx to show bad input processed as expected.
the change to not output empty rows seems to have broken the "flatxlsformtest" case which needs further investigation - seems to just be a question numbering issue and different itext lang order.

Closes #604 Might also address #611 but haven't checked yet.

Why is this the best possible solution? Were any other approaches considered?

It's not clear how / why these buggy spreadsheets come to be. From the associated tickets / forum threads it seems like re-saving with Excel may or may not work. It's possible there is a bug in xlrd or openpyxl to blame, or perhaps these libraries could handle these situations better, but in either case it may be hard to get an upstream fix considering the poor reproduce-ability. The most practical solution seemed to be to have pyxform only process the workbook sheets for as long as there seems to be data in the columns / rows.

What are the regression risks?

As mentioned above there's a broken test which seems at least in part to involve a mismatch from using the row number for a automatically generated question name. But if that's no good then we can still output the empty rows.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

It doesn't seem so at this stage.

Before submitting this PR, please make sure you have:

[x] included test cases for core behavior and edge cases in tests
[x] run nosetests and verified all tests pass
[x] run black pyxform tests to format code
[x] verified that any code or assets from external sources are properly credited in comments

lognaturel commented 2 years ago

Thanks, @lindsay-stevens! I think this is a reasonable approach. Did you also consider @yanokwa's suggestion at the bottom of https://github.com/XLSForm/pyxform/issues/604 to reset the dimensions above a certain reported column count?

lindsay-stevens commented 2 years ago

@lognaturel thanks for taking a look! The source for calculate_dimensions iterates through rows until a single break in truthiness, and uses the largest column index from cells indexed in those rows. A cell object can represent formatting info only (no value data). In the example UCL file the re-calculated column dimension at column AMJ (1024th) is from a break in the grey background colour formatting which then extends on forever from column AMK. The relevant XLSForm data goes to column P (16th) only.

So if a user inserts an empty row or column for formatting, but doesn't colour or style it, and the worksheet dimensions are still saved incorrectly somehow, then the recalc might cut off some data. In other words, it'd result in an extra iteration of the worksheet data (2nd time is to read values), and in doing so might iterate too far or not enough. The approach in this PR also guards against accidentally adding irrelevant data or styling way off in at the edges of the worksheet.

Maybe not as much of an issue for XLS (max cols 256 x rows 65536) but there doesn't seem to be an equivalent method in xlrd.

yanokwa commented 2 years ago

It might be nice to add a warning when do this? Just in case someone has some number of empty rows/columns?

lognaturel commented 2 years ago

Thanks for the additional explanation, @lindsay-stevens, the approach sounds good. I don't feel a strong need for a warning and I imagine it'd be a fair amount of plumbing so doesn't seem worth it.

I'm not sure about the logic and have commented inline. Otherwise the rest of the implementation and the tests look good.

I verified whether it might address #611 and unfortunately it does not. I verified that it does address all 3 forms from the thread in #604. 👍

If at all possible, it would be great to get this released by the end of the week!

lindsay-stevens commented 2 years ago

Thanks for the review @lognaturel. I've fixed up the algo so this could be merged. I'll have a look at #611 and open a docs PR for the blank col/row processing behaviour.

XLSForm / pyxform