XLSForm / pyxform

A Python package to create XForms for ODK Collect.
BSD 2-Clause "Simplified" License
77 stars 134 forks source link

Issue 157 warn about missing translations #564

Closed KeynesYouDigIt closed 2 years ago

KeynesYouDigIt commented 2 years ago

Closes #157

Why is this the best possible solution? Were any other approaches considered?

This approach ensures that the warning work has been done along side the work that puts in placeholders for missing translations. It is focused on making sure we never re-evaluate the form for missing languages but uses what is already implied in the form build.

What are the regression risks?

Last time I attempted to solve this issue, I assumed the existence of certain pieces of the path string and tried to build warning messages using those pieces. When the warnings function tried to parse paths without what I assumed to be present, it broke. I do not believe this risk is still present, but I do assume things like path and content type are not null in this function. This feels like a safe assumption, so I dont think the risks are great.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

No

Before submitting this PR, please make sure you have:

KeynesYouDigIt commented 2 years ago

@lindsay-stevens Thanks so much for that big PR fixing up the tests!!! it got CI to pass!

@lognaturel I was able to clean up and shorten this PR, Hoping it can get a look soon?

codecov-commenter commented 2 years ago

Codecov Report

Merging #564 (ba6329e) into master (6ca3484) will decrease coverage by 0.01%. The diff coverage is 90.32%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #564      +/-   ##
==========================================
- Coverage   84.20%   84.18%   -0.02%     
==========================================
  Files          25       25              
  Lines        3672     3687      +15     
  Branches      865      871       +6     
==========================================
+ Hits         3092     3104      +12     
- Misses        440      441       +1     
- Partials      140      142       +2     
Impacted Files Coverage Δ
pyxform/survey.py 93.40% <85.71%> (-0.39%) :arrow_down:
pyxform/survey_element.py 94.40% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6ca3484...ba6329e. Read the comment docs.

lindsay-stevens commented 2 years ago

@KeynesYouDigIt If I understand the issue correctly, the goal is to warn users when there's one or more "missing" translation columns in the survey (or choices) sheet. A "missing" translation is one where that language appears for one or more of the translatable content types (label, media, etc) specified on the form, but not all translatable content types specified on the form. If no translations specify a certain type then e.g. it's no problem if there is no French image if no other languages have an image column.

I would expect to find this kind of warning near the top of the processing pipeline, at the level of xls2json.py, where the parsed workbook sheets are processed separately and numerous structural validations of this nature are applied. For example for the survey sheet here, and the choices sheet here. In particular the function dealias_and_group_headers is something along these lines for the survey sheet. The current approach appears to be much lower in the pipeline, (possibly?) reverse engineering the output XPath which seems likely to be error prone to implement and fragile to maintain. Could you please clarify the current approach and whether the above xls2json.py locations were considered?

About the refactoring in SurveyElement.get_translations. Before a refactor like this I would prefer to have unit tests on the block to establish that the before/after produce the same result and ensure regressions are not introduced. At the moment the refactor does not appear to be equivalent, mainly on the point that the old code for guidance_hint only wants "guidance_hint" in self.keys() (i.e. it may be blank) but the new code wants self.get("guidance_hint") to be truthy (i.e. it's not blank). It seems a bit peripheral to this PR anyway so maybe it should be moved to a new PR?

About the introduction of the internal__warnings set. I'm not clear on what this is for, since the output does not appear to be used directly. It's added to the existing warnings list in the end, so why not use warnings only?

Lastly for me it would help with reviews if the commit messages for material changes were a bit more detailed, or included in-line comments. For example in the commit to list on to json I'm not clear why this change is needed - what it might be fixing or improving. Or in better is not dict check why type is considered better than isinstance - noting that some pyxform classes subclass dict, and isinstance considers subclasses while type doesn't.

KeynesYouDigIt commented 2 years ago

@lindsay-stevens

Thank you so much for the comment and feedback! I have been needing some guidance and collaboration badly here, as you can see :smile:

Sorry in advance for so much text, the TL:DR is:

Please feel free to @ me on slack if you would rather decline the PR and move me to a simpler or more urgent set of work (I'll defer to you since you use this code much more in your day-to-day work), or if you want to chat more about what I've done here. I mostly only have time for this on weekends, but could get away from work on Fridays, Thursdays, and maybe Tuesdays.

Longer version -

Could you please clarify the current approach and whether the above xls2json.py locations were considered?

I considered and even preferred the upstream approach here, and to be honest now that you bring it up again I think you might be correct. But let me explain why I went downstream:

What caused me to pick this downstream approach is that we actually handle the empty translations themselves downstream (def _add_empty_translations). Since the "state" of a missing translation is detected and handled at this point, it seems redundant to me to detect it and warn about it earlier. You are right that it forces us to reverse engineer the xpath, which I thought would be easier than it turned out to be :/

I am going to spend some time today reconsidering that upstream path again. What I would really like to do is move the translations upstream - when I first had that thought it sounded like too big of a change, but looking at it again I think it would at least be worth spending a few hours seeing what that would look like. Now, if we calculate missing translations upstream, but only create translations downstream, that feels redundant, but I suppose it might not be. (and redundancy might be better than reverse engineering?)

It seems a bit peripheral to this PR anyway so maybe it should be moved to a new PR?

It is, at my company we follow the "boy scout rule" (leave it better than you found it) as a way of packing refactors into PRs. I think the tests cover the case you had mentioned, but I am happy to just back it out for simplicity's sake - the "boy scout rule" might doesn't make sense for this project? (small team, less frequent updates, testing that's more situation-ally driven and doesn't cover small units of work like this particular chunk of code)

so why not use warnings only?

I don't (think?) ther'es a way to access that warnings object in the "guts" of where I do this work (this is another hazard of the downstream approach I am using). I do think its a bit odd that the internal workings can't pass information back up to the user, so while the approach of adding another mutable/stateful attribute was hardly ideal, this was the quickest way I could make it work.

If I can make the more "upstream" approach work we can remove this for now. My guess is that we will need something that achieves the same later? but I could be wrong. This is a pretty mature codebase and it hasn't needed it before :shrug:

KeynesYouDigIt commented 2 years ago

@lindsay-stevens

Reviewing the sections you linked me to - dealias_and_group_headers is very agnostic as to what the "tokens" it works with actually are, which makes it hard to detect if the token is a language. It also works row-by-row, but translations are added column-by-column, so we would have to review the aggregated data again.

I could certainly add something like find_uneven_translations(out_dict_array) at the end of that function, but I am still worried I'd duplicating work with the stuff done in the Survey class (see _add_empty_translations and _setup_translations ).

I haven't done enough work to see if its possible, but what do you think of trying to move some or all of the work in _setup_translations and/or _add_empty_translations to this upstream point? Perhaps we could build JSON elements that are much easier to work with down stream. Before I spend time trying to code a proof of concept, is that what you were getting at with your earlier comment? Or are we strictly trying to detect unbalanced translations and move on? Again, I think that would require duplication of logic, but I am not sure.

I'll keep mulling this over in my mind, still not sure what the best way forward is, LMK if you have any thoughts or if I've missed something here.

lindsay-stevens commented 2 years ago

Hi @KeynesYouDigIt

The dealias_and_group_headers reference was meant as more of an example of examining the XLSForm data and processing/validating. To implement this I think a new function would be appropriate, particularly since dealias_and_group_headers is used by multiple sheets (Settings, Survey, Choices, External Choices, OSM) which don't all have translatable columns.

There are finite translatable column types, as noted in this comment. So after splitting column names on ::, any columns not in a 'translatable' list / mapping could be ignored. Even if it's not currently possible to determine with all certainty what all the translatable columns are, the referenced comment identifies what are likely the most commonly used ones, so the warning would still be valuable.

About the duplication concerns, I think these may overlap code-wise but the purpose is different. The goal for #157 is a warning to the user, which they may ignore at their peril. The existing _add_empty_translations seems to be about structural integrity in the XForm XML. For example if I comment out the body of _add_empty_translations, then in pyxform_test_case.py set run_odk_validate = kwargs.get("run_odk_validate", True), then run all the tests, there are 5 failures from errors from ODK Validate.

I would recommend aiming to address #157 as directly as possible. If it looks like a refactor is needed or potentially beneficial, then please create a separate issue/PR to describe/discuss/implement it. My concern is that there aren't many tests around translations behaviour, so embarking on that at the same time could stymie the PR or cause regressions.

lognaturel commented 2 years ago

Thanks @KeynesYouDigIt for sticking with this one and exploring from multiple angles. I'd like to propose @lindsay-stevens take this one over and put up a new PR. Then you can use what you've learned to do an initial review and evaluate his approach.

As @lindsay-stevens says, if there's an area you feel really needs some refactor, please file an issue for now.

lindsay-stevens commented 2 years ago

@lognaturel sounds good I'll take a look at this in a new PR.

KeynesYouDigIt commented 2 years ago

@lindsay-stevens / @lognaturel Sounds good to me as well, thanks for all the help!

lindsay-stevens commented 2 years ago

closing in favour of #571