Open dgtlmoon opened 7 months ago
tried latest elementpath 4.4.0
same result
But we pinned elementpath==4.1.5 https://github.com/dgtlmoon/changedetection.io/blob/e110b3ee93c6421a5aa6b946f05c4f7d42788f53/requirements.txt#L58
the error comes from elementpath.. tried different versions, same outcome...
this is my custom 45.13 container's pip package version.
aniso8601 9.0.1
apprise 1.7.2
arrow 1.3.0
attrs 23.2.0
Babel 2.14.0
beautifulsoup4 4.12.3
blinker 1.7.0
Brotli 1.1.0
certifi 2024.2.2
cffi 1.16.0
chardet 5.2.0
charset-normalizer 3.3.2
click 8.1.7
cryptography 3.4.8
decorator 5.1.1
dnspython 2.5.0
elementpath 4.2.1
et-xmlfile 1.1.0
feedgen 0.9.0
Flask 2.3.3
flask-babel 4.0.0
Flask-Compress 1.14
flask-expects-json 1.7.0
Flask-Login 0.6.3
flask-paginate 2023.10.24
Flask-RESTful 0.3.10
Flask-WTF 1.2.1
gevent 23.9.1
greenlet 3.0.3
h11 0.14.0
idna 3.6
iniconfig 2.0.0
inscriptis 2.4.0.1
itsdangerous 2.1.2
Jinja2 3.1.3
jinja2-time 0.2.0
jq 1.6.0
jsonpath-ng 1.5.3
jsonschema 4.17.3
loguru 0.7.2
lxml 5.1.0
Markdown 3.5.2
MarkupSafe 2.1.5
memory-profiler 0.61.0
oauthlib 3.2.2
openpyxl 3.1.2
outcome 1.3.0.post0
packaging 23.2
paho-mqtt 2.0.0
pillow 10.2.0
pip 23.2.1
playwright 1.41.2
pluggy 1.4.0
ply 3.11
psutil 5.9.8
pycparser 2.21
pyee 11.0.1
pyrsistent 0.20.0
PySocks 1.7.1
pytest 7.4.4
pytest-flask 1.3.0
python-dateutil 2.8.2
pytz 2024.1
PyYAML 6.0.1
requests 2.31.0
requests-oauthlib 1.3.1
selenium 4.14.0
setuptools 69.0.3
six 1.16.0
sniffio 1.3.0
sortedcontainers 2.4.0
soupsieve 2.5
timeago 1.0.16
trio 0.24.0
trio-websocket 0.11.1
types-python-dateutil 2.8.19.20240106
typing_extensions 4.9.0
urllib3 2.2.0
validators 0.22.0
Werkzeug 3.0.1
wheel 0.41.2
wsproto 1.2.0
WTForms 3.1.2
zope.event 5.0
zope.interface 6.1
this is my custom 45.13 container's pip package version.
are you saying you cant reproduce the issue?
I can reproduce the problem. But it is quite weird.
With "Playwright Chromium/Javascript via 'ws://127.0.0.1:3000/?stealth=1&--disable-web-security=true'", elementpath works
With "Basic fast Plaintext/HTTP Client", 'str' object has no attribute '__name__'
?????
You need to compare the HTML then both in the chrome JS rendered version and using curl
Hi, I believe the bug is originated from libxml2. See also, https://gitlab.gnome.org/GNOME/libxml2/-/issues/716
I found the solution but I need time to ensure.
I took a look at this just to try and brush up on my pdb skills.
The issue here is that lxml believes the html from that site is invalid. There's an issue with elementpath.select() assuming it's on a non-empty tree and not handling that correctly (this is where the exception is coming from). I think an improvement changedetection.io can do here is to check the parser.error_log for errors, maybe only with empty trees as I'm not sure how noisy that error_log is and how often it's non-empty.
Here's where I attached the pdb:
@ezalenski try with python -m pdb -c 'b elementpath/tree_builders.py:229'
and p [ e for e in elem.itersiblings()]
in pdb. That is the problem. and see also https://gitlab.gnome.org/GNOME/libxml2/-/issues/716
Also, please take a look at my test in the PR.
I encountered the same issue. I'm solving it temporarily using XPath1.0 by prepending xpath1:
to the XPath rule.
Hi @amirt01 If you provide the example URL, I would be thankful!
Certainly @Constantin1489! I use changedetection.io to monitor company job sites like those hosted on Lever. I ran into this issue when filtering for the posting names: //*[contains(@data-qa, 'posting-name')]
. I was able to remedy this by changing this filter to: xpath1://*[contains(@data-qa, 'posting-name')]
.
Here is an arbitrary example using Kinsta: Here is a link to the broken watch config. Here is a link to the fixed* watch config.
@amirt01 Thank you! The case you reported will be fixed with the https://github.com/dgtlmoon/changedetection.io/pull/2351
I also came across this issue, it's reproducible in my machine. ChangeDetection version is v0.45.22
The CSS/JSONPath/JQ/XPath Filters is something like //*[@id="Foobar"]/div[1]
.
I'm solving it temporarily using XPath1.0 by prepending xpath1:
to the XPath rule, just as what @amirt01 did.
So it's something like xpath1://*[@id="Foobar"]/div[1]
@leiless would you run the code by modifying the url?
URL='https://jobs.lever.co/kinsta/'
curl $URL | xmllint --html - --debug 2> /dev/null | grep 'ELEMENT html'
@Constantin1489, there is the ELEMENT html
line:
$ curl -fsSL $URL | xmllint --html - --debug 2> /dev/null | grep 'ELEMENT html' -C10
HTML DOCUMENT
encoding=utf-8
URL=-
standalone=true
DTD(html)
ELEMENT html
ATTRIBUTE xmlns
TEXT
content=http://www.w3.org/1999/xhtml
TEXT
content=
ELEMENT head
ELEMENT meta
ATTRIBUTE http-equiv
TEXT
content=Content-Type
Please would you run the code without -C10
?
$ curl -fsSL $URL | xmllint --html - --debug 2> /dev/null | grep 'ELEMENT html'
ELEMENT html
Yes, that is the problem I solved with the PR. @leiless why you edited? That is exactly the bug.
Anyway, if there is some kind like iframe, child ELEMENT html doesn't have same indentation.
Yes, that is the problem I solved with the PR. @leiless why you edited? That is exactly the bug.
screenshot
Sometimes, I got this
ELEMENT html
ELEMENT html
But usually it's:
ELEMENT html
It's weird?
@leiless can you include the URL?
I'm not an expert. This is just my explanation. It would be wrong at some point.
This is XPath1 spec said (https://www.w3.org/TR/1999/REC-xpath-19991116/#root-node)
The root node is the root of the tree. A root node does not occur except as the root of the tree. The element node for the document element is a child of the root node. The root node also has as children processing instruction and comment nodes for processing instructions and comments that occur in the prolog and after the end of the document element.
This explanation is similar to how DOM or XDM looks like.
The point is that the element node for the document element is a child of the root node. (root node != root element node) If we take this definition literally, XPath1 is not possible for our cases.(But you know everybody uses lxml, libxml2 well. everybody has benefit of it.)
We send a document to xmllint, in this case, we can expect a fixed html document would have one html(root element node).
Also, DOM is important. (https://www.w3.org/2008/08/cleantheweb/libxml)
When reading the document on the Web (likely to be invalid) and creating the DOM tree, clients have to recover for syntax errors. HTML 5 Parsing algorithm describes precisely how to recover from erroneous syntax.
So when we think of something no-more-fixable HTML(this is my term. Or means complete html. Also my term), It will have the same structure as DOM.
So, when html4 specs says "html element is optional" is something like, "I know you open a html document in browser. I will make it lemon juice." and the browsers create an html element tag to fix the document.
And the xpath1, xpath2, xpath3, and xpath3.1 expect only one root element node. that is why if the document you receive has more than one root element, in SO, people say,
But the method I choose is, just to create new_root
Object and add multiple element nodes as children and flag an frangment option. It won't re-parse the document because I don't have any legitimacy to make it only one html root element node in this case.
So why this happens. (Again, this is just my explanation. It would be wrong at some point. But I'm easing my pain with it. ) "HTMLparser - interface for an HTML 4.0 non-verifying parser"(https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html). Also if you click the link, you can see what exactly html element it is. It is xml elements and xml nodes. lxml also is the same. (EDIT ADD: https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html#htmlNodePtr)
That is why when only one html root element exists, xpath2-3.1 works. Internally the nodes are xml nodes.
So BTW why then libxml2 has a HTML 4.0 non-verifying parser? Maybe at that time, there were browser wars. there were multiple parsing rules. So many web devs sent html with the wrong syntax. Maybe that leads the spec development for XHTML, HTML5. but you know everybody uses lxml, libxml2 well. everybody has benefit of it.)
I think it is what it is.
Also I already reported this issue. You don't have to do.
@leiless can you include the URL?
https://www.pdrcfw.com/OurNews.aspx
XPath Filters: //*[@id="ArticleList1"]/div[1]
I got the result but after I tried to investigate, I got blocked..?
BTW, my private changedetection function has html source api
</body>
</html>
<a href="/twaf_abc/twaf_abc.html" style="display:none">robots</a>
This is the problem.
and run my command, the result shows two html root elements.
ADD:
this is small html code to reproduce.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="Head1"><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta content="keywords" name="浦东人才服务网,本站动态 " /><title>
浦东人才服务网
</title><link href="html/css/base.css" rel="stylesheet" />
<script src="/html/js/jquery-14B.min.js"></script>
</head>
<body>
</body>
</html>
<a href="/twaf_abc/twaf_abc.html" style="display:none">robots</a>
Yeah, definitely it's because the HTML source code is malformed (code after the final </html>
).
@leiless Also I tried again, today. Now, it shows.
"Your access has been identified as an attack and logged"
So, this can be the reason sometimes you get the one html root element.
So, the non-blocked version(html source code of https://github.com/dgtlmoon/changedetection.io/issues/2318#issuecomment-2133461050 ) will show
another root element is root element siblings.
EDIT:(add image)
@Constantin1489 Great analysis! wonder if this bug will be fixed in the [next] release?
I submitted my PR. However, the maintainer needs time to ensure the PR is the solution.
https://www.pdrcfw.com/OurNews.aspx
has correct <html
open tag
but then...
</body>
</html>
<a href="[/twaf_abc/twaf_abc.html](https://www.pdrcfw.com/twaf_abc/twaf_abc.html)" style="display:none">robots</a>
All versions?
using this shared watch https://changedetection.io/share/QtZ-94DW41sa
'str' object has no attribute '__name__'
error.. i tried different lxml library versions but that made no differencehttps://www.depinte.be/werken and
//div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[1]/div[1]
seems to come from here
https://github.com/dgtlmoon/changedetection.io/blob/e110b3ee93c6421a5aa6b946f05c4f7d42788f53/changedetectionio/html_tools.py#L128
Likely it is
elementpath
related