causal-agent / scraper

HTML parsing and querying with CSS selectors
https://docs.rs/scraper
ISC License
1.79k stars 98 forks source link

Support for `:has()` selector #169

Open 124C41p opened 5 months ago

124C41p commented 5 months ago

Hi, do you plan to support the :has() selector? To my understanding, this css keyword is needed for selecting objects based on the parent of another known object.

Consider the following example:

<div>
    <div id="foo">
        Hi There!
    </div>
</div>
<ul>
    <li>first</li>
    <li>second</li>
    <li>third</li>
</ul>

In order to select the second list item, I would like to use the following selector:

let selector = Selector::parse("div:has(div#foo) + ul > li:nth-child(2)").unwrap();

This line however panics as of scraper version 0.18.1.

adamreichold commented 5 months ago

I think this is still missing support in our upstream selectors dependency, at least in the version published on crates.io.

cyqsimon commented 4 months ago

+1. I'm trying to scrape Wikipedia, which has this sort of nesting. For example:

<h2>
  <span class="mw-headline" id="Registered_ports">Registered ports</span>
  <!-- ... -->
</h2>

This selector: h2:has(#Registered_ports) ~ .wikitable.sortable would pick the first table after this h2, which is a good way to locate the content in lieu of a distinctive id/class on the table itself.

nicoburns commented 4 months ago

From what I can see selectors 0.25 (published to crates.io) does have :has support. See https://docs.rs/selectors/latest/selectors/parser/enum.Component.html#variant.Has Although there seem to be performance improvements in more recent unreleased commits.

nathaniel-daniel commented 4 months ago

https://github.com/servo/servo/issues/25133

jameshurst commented 1 week ago

I had taken a look into adding :is() support and it seems like both :is() and :has() are already supported by selectors. The Parser impl needs to enable support by implementing parse_is_and_where and parse_has.

fn parse_is_and_where(&self) -> bool {
    true
}

fn parse_has(&self) -> bool {
    true
}

@causal-agent Should it be safe to enable support for these selectors? I can make a PR with these changes unless these selectors are not enabled for a reason.

adamreichold commented 6 days ago

The Parser impl needs to enable support by implementing parse_is_and_where and parse_has.

Thank you for looking into this!

Should it be safe to enable support for these selectors? I can make a PR with these changes unless these selectors are not enabled for a reason.

I think only tests will answer that. Please open a PR, ideally including a test case. I can try to then also give it a spin in a code base containing a pretty diverse set of scrapers and see if anything breaks that is not caught by the tests here.

cfvescovo commented 6 days ago

@jameshurst when your PR is ready, tag me. I will run some tests and review it ASAP.