Closed KeynesYouDigIt closed 2 years ago
@lindsay-stevens Thanks so much for that big PR fixing up the tests!!! it got CI to pass!
@lognaturel I was able to clean up and shorten this PR, Hoping it can get a look soon?
Merging #564 (ba6329e) into master (6ca3484) will decrease coverage by
0.01%
. The diff coverage is90.32%
.
@@ Coverage Diff @@
## master #564 +/- ##
==========================================
- Coverage 84.20% 84.18% -0.02%
==========================================
Files 25 25
Lines 3672 3687 +15
Branches 865 871 +6
==========================================
+ Hits 3092 3104 +12
- Misses 440 441 +1
- Partials 140 142 +2
Impacted Files | Coverage Δ | |
---|---|---|
pyxform/survey.py | 93.40% <85.71%> (-0.39%) |
:arrow_down: |
pyxform/survey_element.py | 94.40% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 6ca3484...ba6329e. Read the comment docs.
@KeynesYouDigIt If I understand the issue correctly, the goal is to warn users when there's one or more "missing" translation columns in the survey (or choices) sheet. A "missing" translation is one where that language appears for one or more of the translatable content types (label, media, etc) specified on the form, but not all translatable content types specified on the form. If no translations specify a certain type then e.g. it's no problem if there is no French image if no other languages have an image column.
I would expect to find this kind of warning near the top of the processing pipeline, at the level of xls2json.py
, where the parsed workbook sheets are processed separately and numerous structural validations of this nature are applied. For example for the survey sheet here, and the choices sheet here. In particular the function dealias_and_group_headers is something along these lines for the survey sheet. The current approach appears to be much lower in the pipeline, (possibly?) reverse engineering the output XPath which seems likely to be error prone to implement and fragile to maintain. Could you please clarify the current approach and whether the above xls2json.py
locations were considered?
About the refactoring in SurveyElement.get_translations. Before a refactor like this I would prefer to have unit tests on the block to establish that the before/after produce the same result and ensure regressions are not introduced. At the moment the refactor does not appear to be equivalent, mainly on the point that the old code for guidance_hint only wants "guidance_hint" in self.keys()
(i.e. it may be blank) but the new code wants self.get("guidance_hint")
to be truthy (i.e. it's not blank). It seems a bit peripheral to this PR anyway so maybe it should be moved to a new PR?
About the introduction of the internal__warnings
set. I'm not clear on what this is for, since the output does not appear to be used directly. It's added to the existing warnings
list in the end, so why not use warnings
only?
Lastly for me it would help with reviews if the commit messages for material changes were a bit more detailed, or included in-line comments. For example in the commit to list on to json
I'm not clear why this change is needed - what it might be fixing or improving. Or in better is not dict check
why type
is considered better than isinstance
- noting that some pyxform classes subclass dict
, and isinstance
considers subclasses while type
doesn't.
@lindsay-stevens
Thank you so much for the comment and feedback! I have been needing some guidance and collaboration badly here, as you can see :smile:
Sorry in advance for so much text, the TL:DR is:
to list on to json
was added to support serializing the internalwarnings. Please feel free to @ me on slack if you would rather decline the PR and move me to a simpler or more urgent set of work (I'll defer to you since you use this code much more in your day-to-day work), or if you want to chat more about what I've done here. I mostly only have time for this on weekends, but could get away from work on Fridays, Thursdays, and maybe Tuesdays.
Longer version -
Could you please clarify the current approach and whether the above xls2json.py locations were considered?
I considered and even preferred the upstream approach here, and to be honest now that you bring it up again I think you might be correct. But let me explain why I went downstream:
What caused me to pick this downstream approach is that we actually handle the empty translations themselves downstream (def _add_empty_translations
). Since the "state" of a missing translation is detected and handled at this point, it seems redundant to me to detect it and warn about it earlier. You are right that it forces us to reverse engineer the xpath, which I thought would be easier than it turned out to be :/
I am going to spend some time today reconsidering that upstream path again. What I would really like to do is move the translations upstream - when I first had that thought it sounded like too big of a change, but looking at it again I think it would at least be worth spending a few hours seeing what that would look like. Now, if we calculate missing translations upstream, but only create translations downstream, that feels redundant, but I suppose it might not be. (and redundancy might be better than reverse engineering?)
It seems a bit peripheral to this PR anyway so maybe it should be moved to a new PR?
It is, at my company we follow the "boy scout rule" (leave it better than you found it) as a way of packing refactors into PRs. I think the tests cover the case you had mentioned, but I am happy to just back it out for simplicity's sake - the "boy scout rule" might doesn't make sense for this project? (small team, less frequent updates, testing that's more situation-ally driven and doesn't cover small units of work like this particular chunk of code)
so why not use warnings only?
I don't (think?) ther'es a way to access that warnings object in the "guts" of where I do this work (this is another hazard of the downstream approach I am using). I do think its a bit odd that the internal workings can't pass information back up to the user, so while the approach of adding another mutable/stateful attribute was hardly ideal, this was the quickest way I could make it work.
If I can make the more "upstream" approach work we can remove this for now. My guess is that we will need something that achieves the same later? but I could be wrong. This is a pretty mature codebase and it hasn't needed it before :shrug:
@lindsay-stevens
Reviewing the sections you linked me to - dealias_and_group_headers is very agnostic as to what the "tokens" it works with actually are, which makes it hard to detect if the token is a language. It also works row-by-row, but translations are added column-by-column, so we would have to review the aggregated data again.
I could certainly add something like find_uneven_translations(out_dict_array)
at the end of that function, but I am still worried I'd duplicating work with the stuff done in the Survey class (see _add_empty_translations
and _setup_translations
).
I haven't done enough work to see if its possible, but what do you think of trying to move some or all of the work in _setup_translations and/or _add_empty_translations to this upstream point? Perhaps we could build JSON elements that are much easier to work with down stream. Before I spend time trying to code a proof of concept, is that what you were getting at with your earlier comment? Or are we strictly trying to detect unbalanced translations and move on? Again, I think that would require duplication of logic, but I am not sure.
I'll keep mulling this over in my mind, still not sure what the best way forward is, LMK if you have any thoughts or if I've missed something here.
Hi @KeynesYouDigIt
The dealias_and_group_headers
reference was meant as more of an example of examining the XLSForm data and processing/validating. To implement this I think a new function would be appropriate, particularly since dealias_and_group_headers
is used by multiple sheets (Settings, Survey, Choices, External Choices, OSM) which don't all have translatable columns.
There are finite translatable column types, as noted in this comment. So after splitting column names on ::
, any columns not in a 'translatable' list / mapping could be ignored. Even if it's not currently possible to determine with all certainty what all the translatable columns are, the referenced comment identifies what are likely the most commonly used ones, so the warning would still be valuable.
About the duplication concerns, I think these may overlap code-wise but the purpose is different. The goal for #157 is a warning to the user, which they may ignore at their peril. The existing _add_empty_translations
seems to be about structural integrity in the XForm XML. For example if I comment out the body of _add_empty_translations
, then in pyxform_test_case.py
set run_odk_validate = kwargs.get("run_odk_validate", True)
, then run all the tests, there are 5 failures from errors from ODK Validate.
I would recommend aiming to address #157 as directly as possible. If it looks like a refactor is needed or potentially beneficial, then please create a separate issue/PR to describe/discuss/implement it. My concern is that there aren't many tests around translations behaviour, so embarking on that at the same time could stymie the PR or cause regressions.
Thanks @KeynesYouDigIt for sticking with this one and exploring from multiple angles. I'd like to propose @lindsay-stevens take this one over and put up a new PR. Then you can use what you've learned to do an initial review and evaluate his approach.
As @lindsay-stevens says, if there's an area you feel really needs some refactor, please file an issue for now.
@lognaturel sounds good I'll take a look at this in a new PR.
@lindsay-stevens / @lognaturel Sounds good to me as well, thanks for all the help!
closing in favour of #571
Closes #157
Why is this the best possible solution? Were any other approaches considered?
This approach ensures that the warning work has been done along side the work that puts in placeholders for missing translations. It is focused on making sure we never re-evaluate the form for missing languages but uses what is already implied in the form build.
What are the regression risks?
Last time I attempted to solve this issue, I assumed the existence of certain pieces of the
path
string and tried to build warning messages using those pieces. When the warnings function tried to parse paths without what I assumed to be present, it broke. I do not believe this risk is still present, but I do assume things like path and content type are not null in this function. This feels like a safe assumption, so I dont think the risks are great.Does this change require updates to documentation? If so, please file an issue here and include the link below.
No
Before submitting this PR, please make sure you have:
tests_v1
nosetests
and verified all tests passblack pyxform
to format code