Error at retrieving HIX value when the page content has only a video

MizukiTemma commented 1 month ago

Describe the Bug

HIX value cannot be retrieved when a page has only a video in its content.

Steps to Reproduce

Go to Testumgebung in the test system.
Go to the page "Integreat in Gebärdensprache (Video)"
See the error " HIX value could not be calculated. Please try again later."

This is locally reproducable. Copy the source code of the page and paste it in a page in the local system, you"ll see HIX benchmark API call failed: <HTTPError 400: 'Bad Request'>

Expected Behavior

No error

Actual Behavior

Error appears

Additional Information

See also #2917 If some words are added into the content, HIX value is retrieved and saved successfully.

Traceback

``` ```

MizukiTemma commented 1 month ago

Solution from the issue grooming: do not send the text to TextLab if the page doesn not contain real text.

PeterNerlich commented 1 month ago

I think we could use the combination of two conditions to determine whether we regard a page as empty, only non-textual content or content to evaluate by TextLab:

Extract plain text – is that empty? The lxml library features HtmlElement.text_content(), a method that traverses all child nodes to extract the plain text. This is an inexpensive operation. We should trim any leading and trailing whitespace before comparing to "" for robustness.
Do any non-text tags exist? We could walk the parsed tree ourselves and check if it contains any tags that we do not associate with text. I think it makes more sense to check for anything not expected to contain text instead of checking for a set list of specific tags like img and video, because I think the latter is much harder to make an exhaustive list. This might not hold true if we find that we already limit our content to a very small subset of possible tags during our normalization step. Tags coming to mind quickly include: div, p, br, span, i, b, em, strong, a, blockquote, ul, ol, li, mark, s, sub, sup, del, ins

digitalfabrik / integreat-cms