XLSForm / pyxform

A Python package to create XForms for ODK Collect.
BSD 2-Clause "Simplified" License
77 stars 134 forks source link

avoid processing vast empty areas of buggy / strange workbooks #612

Closed lindsay-stevens closed 2 years ago

lindsay-stevens commented 2 years ago

Closes #604 Might also address #611 but haven't checked yet.

Why is this the best possible solution? Were any other approaches considered?

It's not clear how / why these buggy spreadsheets come to be. From the associated tickets / forum threads it seems like re-saving with Excel may or may not work. It's possible there is a bug in xlrd or openpyxl to blame, or perhaps these libraries could handle these situations better, but in either case it may be hard to get an upstream fix considering the poor reproduce-ability. The most practical solution seemed to be to have pyxform only process the workbook sheets for as long as there seems to be data in the columns / rows.

What are the regression risks?

As mentioned above there's a broken test which seems at least in part to involve a mismatch from using the row number for a automatically generated question name. But if that's no good then we can still output the empty rows.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

It doesn't seem so at this stage.

Before submitting this PR, please make sure you have:

lognaturel commented 2 years ago

Thanks, @lindsay-stevens! I think this is a reasonable approach. Did you also consider @yanokwa's suggestion at the bottom of https://github.com/XLSForm/pyxform/issues/604 to reset the dimensions above a certain reported column count?

lindsay-stevens commented 2 years ago

@lognaturel thanks for taking a look! The source for calculate_dimensions iterates through rows until a single break in truthiness, and uses the largest column index from cells indexed in those rows. A cell object can represent formatting info only (no value data). In the example UCL file the re-calculated column dimension at column AMJ (1024th) is from a break in the grey background colour formatting which then extends on forever from column AMK. The relevant XLSForm data goes to column P (16th) only.

So if a user inserts an empty row or column for formatting, but doesn't colour or style it, and the worksheet dimensions are still saved incorrectly somehow, then the recalc might cut off some data. In other words, it'd result in an extra iteration of the worksheet data (2nd time is to read values), and in doing so might iterate too far or not enough. The approach in this PR also guards against accidentally adding irrelevant data or styling way off in at the edges of the worksheet.

Maybe not as much of an issue for XLS (max cols 256 x rows 65536) but there doesn't seem to be an equivalent method in xlrd.

yanokwa commented 2 years ago

It might be nice to add a warning when do this? Just in case someone has some number of empty rows/columns?

lognaturel commented 2 years ago

Thanks for the additional explanation, @lindsay-stevens, the approach sounds good. I don't feel a strong need for a warning and I imagine it'd be a fair amount of plumbing so doesn't seem worth it.

I'm not sure about the logic and have commented inline. Otherwise the rest of the implementation and the tests look good.

I verified whether it might address #611 and unfortunately it does not. I verified that it does address all 3 forms from the thread in #604. 👍

If at all possible, it would be great to get this released by the end of the week!

lindsay-stevens commented 2 years ago

Thanks for the review @lognaturel. I've fixed up the algo so this could be merged. I'll have a look at #611 and open a docs PR for the blank col/row processing behaviour.