Closed MizukiTemma closed 1 month ago
Solution from the issue grooming: do not send the text to TextLab if the page doesn not contain real text.
I think we could use the combination of two conditions to determine whether we regard a page as empty, only non-textual content or content to evaluate by TextLab:
HtmlElement.text_content()
, a method that traverses all child nodes to extract the plain text. This is an inexpensive operation. We should trim any leading and trailing whitespace before comparing to ""
for robustness.img
and video
, because I think the latter is much harder to make an exhaustive list. This might not hold true if we find that we already limit our content to a very small subset of possible tags during our normalization step.
Tags coming to mind quickly include: div
, p
, br
, span
, i
, b
, em
, strong
, a
, blockquote
, ul
, ol
, li
, mark
, s
, sub
, sup
, del
, ins
Describe the Bug
HIX value cannot be retrieved when a page has only a video in its content.
Steps to Reproduce
This is locally reproducable. Copy the source code of the page and paste it in a page in the local system, you"ll see
HIX benchmark API call failed: <HTTPError 400: 'Bad Request'>
Expected Behavior
No error
Actual Behavior
Error appears
Additional Information
See also #2917 If some words are added into the content, HIX value is retrieved and saved successfully.
Traceback
``` ```