dgtlmoon / changedetection.io

The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification
https://changedetection.io
Apache License 2.0
15.85k stars 884 forks source link

XPath3.1: mimic handling of multiple root element nodes #2351

Open Constantin1489 opened 2 months ago

Constantin1489 commented 2 months ago

Obviously, some web server provides broken html. The lxml and libxml2 fix it. It's good and indeed great!!! (We have been happy for decades!)

But, at the point, the error I want to solve occurs, the elementpath describes the DOM structure. it's because sometimes lxml or libxml2 returns multiple root element nodes when using html parser. (This could be a trace? of the browser wars. I don't remember the article but there were four kinds of html parser rules because of four major browsers.)

See also, https://gitlab.gnome.org/GNOME/libxml2/-/issues/716

So I mimicked it. The test I included describes the point.

fixes #2318

dgtlmoon commented 2 months ago

As an idea, what about having this enabled by default as a config option?

Constantin1489 commented 1 month ago

Valid HTML DOM are all alike; every non-valid HTML DOM is not valid and unhappy in its own way.

Since the suggestion will add another synthetic root element by default, the point is the initial context item.

For that, three possible options exist.

  1. the first is that (fragment True and the context item is the new root element).

  2. the second is that (fragment True and, the context item is dynamic). The latter means that if only one root element exists, then that is the context item. when multiple root elements exist, the context item will be the new_root element. So the second is meaningless. it would be the same with current PR.

  3. the third one is that fragment True and don't care about the number of root element as a context item and select one of them. It will reduce the accessible scope of information for multiple root element cases. this is unacceptable.

One may say that I chose the wrong html parser and I should choose etree.HTML. But, in this case the "html" tag root element node can have html tag as a child. In this case "/html/html"

To be honest, all sorts of solutions seem intuitively unfair even with mine. But libxml2 will develop html5.

As I mentioned, selecting one of the html "root element nodes" as a context item is not an option.

So, in the first method, the context item is one depth deeper(?) for all cases. (https://github.com/dgtlmoon/changedetection.io/actions/runs/9057420996/job/24881401637) e.g. manager[@name = 'Godot'] -> branches_to_visit/manager[@name = 'Godot'] So, making the default new root element node for all cases makes sense when the context item is the "new_root" element.

dgtlmoon commented 1 month ago

do we need this one? it could help https://github.com/dgtlmoon/changedetection.io/pull/2175 , thoughts?

Constantin1489 commented 1 month ago

the minimum version of elementpath is 4.2.1 because of https://github.com/Constantin1489/changedetection.io/actions

dgtlmoon commented 1 month ago

Looks OK to me, i guess it doubles the CPU usage for checking a watch right?

Constantin1489 commented 1 month ago

Good point. Since the new function will inevitably increase the usage, I chose just another method to increase speed. in my test, 507427 function calls (502074 primitive calls) in 0.374 seconds becomes 12909 function calls (12291 primitive calls) in 0.007 seconds

In the 138 tests in CI, it was 0.39 sec(https://github.com/dgtlmoon/changedetection.io/actions/runs/9107046053/job/25035224418#step:9:3365). now it has become 0.31 sec(https://github.com/dgtlmoon/changedetection.io/actions/runs/9126805653/job/25095800223?pr=2351#step:9:3367) like this function doesn't exist.(https://github.com/dgtlmoon/changedetection.io/actions/runs/9124493905/job/25088778195#step:9:3292)

Constantin1489 commented 1 month ago

BTW, I couldn't find any evidence that lxml parses the content again. Could you provide me an example of double CPU usage?

Constantin1489 commented 1 month ago

BTW, the reason why I don't do lazy import is because it is Python. I'm not an expert of php but a SOF user said this point (https://stackoverflow.com/a/10084940/20307768)

dgtlmoon commented 1 month ago

could you merge current master into this branch so we can test again? thanks!

dgtlmoon commented 1 month ago

So this is nearly always caused by a missing <html open tag right?

Constantin1489 commented 1 month ago

@dgtlmoon As I posted to https://gitlab.gnome.org/GNOME/libxml2/-/issues/716, minimal codes are

<!DOCTYPE HTML>
<html></html>
<link href="/example/uri" rel="stylesheet" type="text/css" />

OR

<!DOCTYPE HTML>
<html></html>
Some string

OR

<!DOCTYPE HTML>
<html></html>
<Some/>

In this case, libxml2, and lxml returns two html root element nodes.

cat <<EOF | xmllint --html - --output
<!DOCTYPE HTML>
<html></html>
Some string
EOF

or

cat <<EOF | xmllint --html - --output
<!DOCTYPE HTML>
<html></html>
<Some/>     
EOF
Constantin1489 commented 1 month ago

I mean this is an awesome algorithm! How much wealth has been generated over the decades?

dgtlmoon commented 1 week ago

please, could you update this with latest master ?