XLSForm / pyxform

A Python package to create XForms for ODK Collect.
BSD 2-Clause "Simplified" License
77 stars 134 forks source link

Translation checks only run on columns used by first data row #637

Closed lognaturel closed 5 months ago

lognaturel commented 1 year ago

While taking a look at #601, I noticed the use of sheet_data[0] at

https://github.com/XLSForm/pyxform/blob/master/pyxform/validators/pyxform/missing_translations_check.py#L90

This means translation checks are only run on the first row of a form. Unfortunately, that will not always use all translation columns. For example, the following test fails because no warning is produced:

    def test_translation_checks__work_when_first_row_columns_not_all_populated(self):
        md = """
        | survey |          |          |            |            |
        |        | type     | name     | label::eng | hint       |
        |        | deviceid | deviceid |            |            |
        |        | note     | n1       | hello      | salutation |
        """
        warning = format_missing_translations_msg(
            _in={SURVEY: {"eng": ["hint"], "default": ["label"]}}
        )
        self.assertPyxformXform(
            name="test",
            md=md,
            warnings__contains=[warning],
        )

If the deviceid row of the form is omitted, the check works as intended.

I'm not entirely sure what to do about this. I think the ideal would be to check the actual column headers. I believe that would mean adding header validation to xlsx_to_dict/xls_to_dict/csv_to_dict or passing on header information from there. Those currently don't know about the different expected sheets (which seems like a good thing).

Another option would be to do the translation check while processing each row. That might not be great performance-wise but I'm not sure.

lindsay-stevens commented 1 year ago

translation checks are only run on the first row of a form

There was a performance concern (link) that lead to that. In that thread I noted having checked validating all items and it took "+/- a few ms at 5k questions".

check the actual column headers

I think I looked at getting the headers into workbook_to_json and it was one of those things that seems simple initially but kinda isn't (maybe in the context of the rest of #157).

lognaturel commented 1 year ago

Thanks, I had forgotten that conversation and that change.

it took "+/- a few ms at 5k questions".

That seems acceptable after all? Right now I expect most forms are missing out on this check. It's really common to get deviceid, starttime, etc at the start of a form and/or to start with a group or repeat. An alternative might be to check the first ~15 rows or something.

My current preference is to go back to checking every row. Maybe we could add the check to the main loop over all rows? I'm not entirely sure that would make a difference, even 5k is not a big number of list entries to loop over.

I think I looked at getting the headers into workbook_to_json and it was one of those things that seems simple initially but kinda isn't

I agree, that was what I set out to do with #601 and it would really require a significant restructuring.

lindsay-stevens commented 5 months ago

Since #666 the translations check uses the column headers, which addresses this bug. If it's desirable to warn users about any missing translation in any row, I think that's a different feature for a new ticket.