Closed domenic closed 10 months ago
Does nwmatcher mean nwsapi? I haven't seen the source of nwmatcher. Assuming that you are talking about nwsapi, I will give my personal opinion below.
I think it's fast because it doesn't generate an AST from the given CSS selector string, but directly applies the regular expression to the string.
However, as is typical of this line, it relies on negative matches [^xyz]
which can potentially lead to reDos.
https://github.com/dperini/nwsapi/blob/master/src/nwsapi.js#L81
If you try linting with e.g. eslint-plugin-regexp
, I think you will get a lot of warnings?
Another reason for the speed may be that the Function()
itself is generated from the string. But it doesn't seem like the input strings are being sanitized, which is also a concern.
https://github.com/dperini/nwsapi/blob/master/src/nwsapi.js#L760
Sorry, yes, I think that nwmatcher was the old name.
I think it's fast because it doesn't generate an AST from the given CSS selector string, but directly applies the regular expression to the string.
When looking at profiles, I don't think the AST building showed up as a cost? Or maybe I don't know which parts of dom-selector are the AST building parts. The hot code was in _collectNodes
, _collectNthOfType
, _matchPseudoClassSelector
, etc. The AST building is inside css-tree
, right? Nothing in css-tree
showed up in the profiles. And parser.js
shows up as taking 0.1% of the total time.
However, as is typical of this line, it relies on negative matches
[^xyz]
which can potentially lead to reDos.
I feel like reDOS is a fake vulnerability made up to generate work for security scanning companies and lint plugin writers... for jsdom at least it is not in our threat model. Complaining that someone might be able to make your code slow via a regexp is not a good reason to use something slower than regexp :)
Another reason for the speed may be that the
Function()
itself is generated from the string. But it doesn't seem like the input strings are being sanitized, which is also a concern.
I think the input strings are being generated via nwsapi itself, so it should be fine.
Overall, looking at the profiles for nwsapi vs. dom-selector, I am just struck by how much more work dom-selector does and how different the call stacks look. I think it might be educational for you to focus on one case and just look at exactly how nwsapi does it and see if it's possible to adopt that approach to dom-selector.
nwsapi profile:
dom-selector profile:
It looks to me like dom-selector is crawling the DOM tree a lot more, maybe?
And the reason why dom-selector is slow is that:
div.classA#id[attr]:nth-child(2n+1)
, it will loop max 5 time against the same element (tag, class, id, attr, pseudo-class).
I'm planning if it can be done in worker thread, but I haven't been successful yet.querySelectorAll()
, it rely on getElementsByXXX, but when converting the HTML collection of getElementsByXXX to an array, the conversion time increases to the number of elements.
It's still better than traversing whole DOM tree though.The AST building is inside css-tree, right? Nothing in css-tree showed up in the profiles. And parser.js shows up as taking 0.1% of the total time.
True, but testing each AST against node takes time...
- For example, if you give
div.classA#id[attr]:nth-child(2n+1)
, it will loop max 5 time against the same element (tag, class, id, attr, pseudo-class).
Oh, wow. There must be a better way to do this, right? Like, by parsing out the selector ahead of time, you should be able to figure out the 5 checks you need to apply. Then, you can loop through each node once, and apply all 5 checks at once.
I think this is by far the most promising option.
- when converting the HTML collection of getElementsByXXX to an array
Why do you need to convert it to an array? Can't you operate directly on the HTMLCollection? Sure, HTMLCollections are not as nice as arrays, but it doesn't seem worth paying the cost.
I'm converting to array to simplify the iteration in the next step of processing.
for (const i of arr) {}
.
But:
for(let i = 0; i < l; i++) {
let item;
if (Array.isArray(arr)) {
item = arr[i]
} else {
item = arr.item(i)
}
}
Maybe it would be better to do above, I'll take the bench and compare. Thanks.
Hmm... not a big difference.
Above: converting HTML collection Below: directly using HTML collection
--
--
--
--
--
--
--
--
How about use nwsapi only if selectors are supported and use dom-selector for the rest? Not tested, but maybe something like below.
// Filter selectors that nwsapi does not support or cannot handle correctly:
// namespaced selector e.g. ns|E, pseudo-element e.g. ::slotted,
// :focus, :indeterminate, :placeholder-shown, :scope,
// :dir(), :lang(), :has(), :is(), :where(),
// :not(complex selector), :not(:pseudo-class), :not(attr)
// :nth-child(an+b of selector), :nth-last-child(an+b of selector),
// :host, :host(), :host-context(),
// [attr i], [attr s]
if (!/\||:(?::|focus|indeterminate|placeholder|scope|dir|lang|has|is|where|not\(\s*(?:\*|[\.#]?[\da-z-_]+(?:[\.#][\da-z-_]+)*)(?:\s*,\s*(?:\*|[\.#]?[\da-z-_]+(?:[\.#][\da-z-_]+)*))*\s*\)|nth-(?:last-)?child\(.+\sof.+\)|host)|\s[is]\s*\]/i.test(selector)) {
// nwsapi
} else {
// dom-selector
};
I'm not very excited about that approach; it seems likely fragile where developers might end up switching between the engines non-transparently. I'm hopeful we can have a selector engine in jsdom which uses a fast-by-design architecture.
Did you make any progress on the approach I suggested above? Or is it not good?
Oh, wow. There must be a better way to do this, right? Like, by parsing out the selector ahead of time, you should be able to figure out the 5 checks you need to apply. Then, you can loop through each node once, and apply all 5 checks at once.
I do not think it's fragile. I have submitted a new pull request. https://github.com/jsdom/jsdom/pull/3678 Performance remains high for selectors that nwsapi supports. Frankly speaking, I don't think nwsapi can support Selector level 4, so the scope of using nwsapi will not expand any further.
Did you make any progress on the approach I suggested above? Or is it not good?
It's running in a loop, but it's set to fail fast, so as much optimization as possible has already been done.
It's possible nwsapi cannot support selectors level 4. But it's clear that it's possible to create fast software that does, since we have 3 browser engines all implementing the spec. So I am not willing to incorporate software with fundamentally slow architecture, when we know that fast is possible.
Got it. Unfortunately, I don't think dom-selector can be a help then.
I would like to leave a feedback.
The current issue that jsdom wants to solve is:
Right?
And what you can use now as tools are:
There are currently no tool that can support selector level 4 with high performance (at least, none have been found). If so, wouldn't it make more sense to use the tools currently available than to wait for what you don't have and leave the current issue unsolved? And once "a tool that support selector level 4 with very fast performance" comes out, wouldn't it be enough to abandon nwsapi and dom-selector at that time?
I don't care whether dom-selector will be included or not. I just wish jsdom to be more convenient because I rely on it heavily.
The current issue that jsdom wants to solve is:
- Extended support for selector level 4
Right?
This is not really accurate.
jsdom is a volunteer project, with many competing priorities. Given limited maintainer bandwidth, one of the main goals is minimizing maintenance burden, via solid architecture. A big part of this is minimizing dependencies. Having two selector engines instead of one is problematic from this point of view. Having a selector engine with a bad-performance-by-design architecture, in particular, is going to be a significant burden going forward for us. (As we've seen from the time spent churning on jsdom 23.2.0 to 24.0.0.) This is true even if that engine only kicks in sometimes. Actually, that's especially true: it would mean the maintainers need to know exactly what causes the different engines to kick in, e.g. to triage whether a given bug report needs to be reported against selector engine 1 vs. 2, or maybe versus the logic that determines which selector engine to dispatch to.
We'd rather maintain tight control over our dependencies, ensuring that they're all designed to work well with jsdom, including from a performance perspective. Maintainability is more important than features.
That's why I opened this issue, to encourage you to adopt some of the techniques nwsapi (or browsers themselves) use. I suspect that if you look at exactly how nwsapi accomplishes its speed, e.g. by tracing through all code paths taken for the simpler selectors, you can find ways to incorporate that into dom-selector's architecture. This seems especially likely to me since per the performance flamegraphs discussed above, the parsing is not a bottleneck; it's all about faster DOM retrieval and traversal techniques.
I don't agree with some things you say, but I understand.
As an alternative, I decided to include nwsapi in dom-selector. Performance should be maintained as before in most cases, because nwsapi handles what it can handle. See Performance However, please note that the performance issue is not completely resolved.
It's up to you whether you want to add dom-selector again to jsdom. If you say a go, I will send a PR.
nwsapi is not good at handling child and/or descendant selectors such as .foo > .bar
and .foo .bar
in querySelector() and querySelectorAll().
By handling that part with dom-selector and complementing each other, the performance improved.
https://github.com/asamuzaK/domSelector/actions/runs/7898913102/job/21557270583#step:5:76
https://github.com/asamuzaK/domSelector/actions/runs/7898913102/job/21557270583#step:5:100
It is also noticeable in the sizzle benchmark. https://github.com/asamuzaK/domSelector/actions/runs/7898913102/job/21557270583#step:6:10 https://github.com/asamuzaK/domSelector/actions/runs/7898913102/job/21557270583#step:6:16
However, the performance will be worse than before for selectors with nested logical combinators, such as :not(:is(.foo, .bar))
.
nwsapi has bugs in :is(), :where() and :not().
In such a situation, it is almost impossible for me to find a way for nwsapi to handle it correctly using only regular expressions or other convenient methods.
Fixed above.
I have a small idea for performance work, which may be good or may be bad. I wanted to share it with you in case it was helpful.
Basically, if you take a benchmark where nwmatcher is fast, have you looked at how it does its work? And how that work compares to dom-selector? Can dom-selector switch to nwmatcher's strategy? Doing so might require ugly special cases or fast paths, but maybe it is worth it.
Feel free to close this if it is not helpful.