XLSForm / pyxform

A Python package to create XForms for ODK Collect.
BSD 2-Clause "Simplified" License
80 stars 136 forks source link

Use read-only openpyxl representation to reduce memory usage #596

Closed lognaturel closed 2 years ago

lognaturel commented 2 years ago

Closes #595

Why is this the best possible solution? Were any other approaches considered?

Even though #595 is not critical, this seems like a good change. We don't ever modify an Excel file so there's no point in opening it for writing.

There are slight differences between the writeable and read-only workbook representations I had to handle. The biggest decision I made was not to worry about the difference in dealing with trailing empty rows. That is, if there are empty rows in between some rows with contents, the two representations are the same. If there are empty rows after which there are no more rows with content, the read-only representation includes those empty rows whereas the writeable one does not. One of the equivalency tests between XLS and XLSX picked up on that difference. I tried dropping all empty rows but this would change the output for a rogue table-list feature that uses row number to name nodes (!). I verified that empty rows don't affect the output and decided to change the test XLSX instead of changing the code. That seems like the less risky option. I changed the XLSX and XLS files the same way for the group and specify_other forms. I deleted the rows with formatting but no contents.

What are the regression risks?

There could be some side effects to the different handling of empty rows and columns. I am not thinking of any but that would be good to get a second opinion on.

Before submitting this PR, please make sure you have:

lognaturel commented 2 years ago

I'm planning to do a pyxform release first thing tomorrow (GMT-7) and would like to include this if possible. @lindsay-stevens would be fantastic if you have time for a quick review. If you don't have time or think it's riskier than I believe, let me know and we can release without!

lognaturel commented 2 years ago

Sadly I don't understand why Windows tests are failing.

ERROR: Should find that XLSForm conversion produces itemsets.csv from external_choices.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\pyxform\pyxform\tests\test_external_instances_for_selects.py", line 418, in test_itemset_csv_generated_from_external_choices
    self.assertEqual('"suburb","Footscray","vic","melbourne"\n', rows[-1])
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\contextlib.py", line 119, in __exit__
    next(self.gen)
  File "D:\a\pyxform\pyxform\tests\utils.py", line 74, in get_temp_dir
    shutil.rmtree(temp_dir)
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\shutil.py", line 516, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\shutil.py", line 400, in _rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "C:\hostedtoolcache\windows\Python\3.7.9\x64\lib\shutil.py", line 398, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\tmp9ms014r1\\select_one_external.xlsx'
lindsay-stevens commented 2 years ago

Sadly I don't understand why Windows tests are failing.

I haven't tested but it could be that the workbook is still open, as mentioned here: https://openpyxl.readthedocs.io/en/latest/optimized.html

image

lognaturel commented 2 years ago

it could be that the workbook is still open

Thanks so much for helping me read the docs. That seemed like a likely culprit but either I'm closing in the wrong spot or something else is going on. Unfortunately I'm out of time for this tonight.

Having chatted with @yanokwa and Team Central, I think this fix should go in the pyxform release so I haven't released yet. We don't know how the extra columns get generated and it seems like the kind of thing that others would run into so best to patch it now. If you have a chance to review during your day, great. Ideas on the open files issue much appreciated! I'll come back to it tomorrow.

lindsay-stevens commented 2 years ago

I came up with a workaround for the Windows test failures, PR here to your branch. Otherwise, this PR seems good to go.

lognaturel commented 2 years ago

Nice working with you on this, @lindsay-stevens 🤝

lognaturel commented 2 years ago

From @lindsay-stevens on the PR to my fork here:

Just a note for posterity about my guess for antivirus. Main clue was the Windows "Resource Monitor" tab for Disk activity. This activity list shows active file paths and the PID and image name doing read/write on them. When running the tests, the file was being read by a process with image name "System" and description "NT Kernel & System". Not sure what else the system would be doing with files except to scan them - but whatever it's doing we can't exactly terminate the system process 💥

The same thing happened when I changed the temp root to pyxform/tests/test_output so it doesn't seem to be related to using the system/user temporary directory.