XLSForm / pyxform

A Python package to create XForms for ODK Collect.
BSD 2-Clause "Simplified" License
77 stars 134 forks source link

Levenshtein distance check / warning for sheet name misspellings #544

Closed lindsay-stevens closed 2 years ago

lindsay-stevens commented 2 years ago

Closes #443 Closes #527

Why is this the best possible solution? Were any other approaches considered?

It implements the suggested heuristic for Levenshtein distance <= 2 to detect possible sheet name misspellings. It's possible that this will warn about sheet names for sheets that have auxilliary metadata that is not meant for processing by pyxform. For example the "osm" sheet name may be similar to many other initialisms. In that case, the warning prompts users to prefix such sheet names with an underscore to suppress the warning.

Other approaches could involve more complex spell check algorthims. The current Levenshtein distance algo is well documented and does a decent job at finding similar spellings. It seemed like overkill to add a module dependency just to have more advanced spell check.

What are the regression risks?

As mentioned above, some users may like to store extra metadata in similarly named sheets to those that are part of the SUPPORTED_SHEET_NAMES constant. In which case they would need to rename the sheets to suppress the warning, if the warning is a problem.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

I don't think a documentation change is needed.

Before submitting this PR, please make sure you have:

codecov-commenter commented 2 years ago

Codecov Report

Merging #544 (69c5299) into master (2e631b1) will decrease coverage by 0.54%. The diff coverage is 96.49%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #544      +/-   ##
==========================================
- Coverage   83.85%   83.31%   -0.55%     
==========================================
  Files          25       25              
  Lines        3747     3727      -20     
  Branches      875      871       -4     
==========================================
- Hits         3142     3105      -37     
- Misses        457      482      +25     
+ Partials      148      140       -8     
Impacted Files Coverage Δ
pyxform/constants.py 100.00% <ø> (ø)
pyxform/question.py 93.42% <ø> (ø)
pyxform/xls2json.py 75.97% <95.00%> (-2.59%) :arrow_down:
pyxform/utils.py 86.30% <100.00%> (+1.44%) :arrow_up:
pyxform/xls2json_backends.py 76.47% <100.00%> (-5.99%) :arrow_down:
pyxform/survey.py 93.49% <0.00%> (+0.76%) :arrow_up:
pyxform/xls2xform.py 81.81% <0.00%> (+2.59%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2e631b1...69c5299. Read the comment docs.

lindsay-stevens commented 2 years ago

Have you seen the messages when converting a real form? I am not seeing any of them.

Thanks for this review. It seems like I have put the spell check too far down the pipeline. As you mentioned, only supported sheet names would come through from the actual spreadsheet parsing step before it. I'll have another go and use end to end tests instead. This kind of issue is why I avoided using the PyxformTestCase - it's close to, but not fully, end to end.

lindsay-stevens commented 2 years ago

Hi @lognaturel - please see below change list. I think this addresses all of the above and what was discussed a week or so ago.

lognaturel commented 2 years ago

Apologies for the delay, here. Looks great. 🙏