lindsay-stevens commented 2 years ago

Closes #443 Closes #527

Why is this the best possible solution? Were any other approaches considered?

It implements the suggested heuristic for Levenshtein distance <= 2 to detect possible sheet name misspellings. It's possible that this will warn about sheet names for sheets that have auxilliary metadata that is not meant for processing by pyxform. For example the "osm" sheet name may be similar to many other initialisms. In that case, the warning prompts users to prefix such sheet names with an underscore to suppress the warning.

Other approaches could involve more complex spell check algorthims. The current Levenshtein distance algo is well documented and does a decent job at finding similar spellings. It seemed like overkill to add a module dependency just to have more advanced spell check.

What are the regression risks?

As mentioned above, some users may like to store extra metadata in similarly named sheets to those that are part of the SUPPORTED_SHEET_NAMES constant. In which case they would need to rename the sheets to suppress the warning, if the warning is a problem.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

I don't think a documentation change is needed.

Before submitting this PR, please make sure you have:

[x] included test cases for core behavior and edge cases in tests_v1
[x] run nosetests and verified all tests pass
[x] run black pyxform to format code
[x] verified that any code or assets from external sources are properly credited in comments

codecov-commenter commented 2 years ago

Codecov Report

Merging #544 (69c5299) into master (2e631b1) will decrease coverage by 0.54%. The diff coverage is 96.49%.

@@            Coverage Diff             @@
##           master     #544      +/-   ##
==========================================
- Coverage   83.85%   83.31%   -0.55%     
==========================================
  Files          25       25              
  Lines        3747     3727      -20     
  Branches      875      871       -4     
==========================================
- Hits         3142     3105      -37     
- Misses        457      482      +25     
+ Partials      148      140       -8

Impacted Files	Coverage Δ
pyxform/constants.py	`100.00% <ø> (ø)`
pyxform/question.py	`93.42% <ø> (ø)`
pyxform/xls2json.py	`75.97% <95.00%> (-2.59%)`	:arrow_down:
pyxform/utils.py	`86.30% <100.00%> (+1.44%)`	:arrow_up:
pyxform/xls2json_backends.py	`76.47% <100.00%> (-5.99%)`	:arrow_down:
pyxform/survey.py	`93.49% <0.00%> (+0.76%)`	:arrow_up:
pyxform/xls2xform.py	`81.81% <0.00%> (+2.59%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2e631b1...69c5299. Read the comment docs.

lindsay-stevens commented 2 years ago

Have you seen the messages when converting a real form? I am not seeing any of them.

Thanks for this review. It seems like I have put the spell check too far down the pipeline. As you mentioned, only supported sheet names would come through from the actual spreadsheet parsing step before it. I'll have another go and use end to end tests instead. This kind of issue is why I avoided using the PyxformTestCase - it's close to, but not fully, end to end.

lindsay-stevens commented 2 years ago

Hi @lognaturel - please see below change list. I think this addresses all of the above and what was discussed a week or so ago.

update test style from dict + workbook_to_json to assertPyxformXform
update spelling message to format candidates as quoted strings
add all sheet names to workbook_dict so they can be used for spell check but otherwise only normally process supported sheets
apply spell check rules:
- only run spell check for choices or external_choices if required
- always run spell check for survey (required) and settings (optional)
- don't run spell check for OSM sheet
remove code related to deprecated sheets cascading_choices and columns
add end to end tests for the warnings, and inclusion of all sheet names in initial workbook dict
add test cases for various scenarios with each sheet name with different assertions as per above rules

lognaturel commented 2 years ago

Apologies for the delay, here. Looks great. 🙏

XLSForm / pyxform

Levenshtein distance check / warning for sheet name misspellings #544

Why is this the best possible solution? Were any other approaches considered?

What are the regression risks?

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Before submitting this PR, please make sure you have:

Codecov Report