facelessuser / soupsieve

A modern CSS selector implementation for BeautifulSoup
https://facelessuser.github.io/soupsieve/
MIT License
206 stars 39 forks source link

Add a custom :contains-regexp() pseudo class? #117

Open facelessuser opened 5 years ago

facelessuser commented 5 years ago

This is open currently as an exploratory idea. This would be a custom pseudo-class that would allow for regular expression searches of content. The idea would probably not be to include regular expression directly in the pattern, but most likely references to compiled patterns:

pattern = re.compile(r'some .*? pattern')
regexp = {'content_pattern': pattern}
sv.compile('p:-regex(content_pattern)', regexp=regexp)

Do we make this like contains, and have it search all children of p looking for the pattern, or do we constrain it to the target element of p? Or do we have two variants that do all children or only the target: :-regexp() and :-regexp-direct (or some other name that gets the idea across).

Anyways this is just an idea, but maybe in the future (if we flesh this out enough), we can implement this.

facelessuser commented 5 years ago

It's important to note Beautiful Soup already provides regex, we don't need this, but it might be nice to incorporate regex in some way for selectors as well. We just need to decide if we are willing to pay to commit to a solution, and what that solution should look like.

facelessuser commented 5 years ago

If we do this, a name like :contains-regexp() might be more descriptive and make more sense.

When defining regex keywords, should we require them to be in the form of custom CSS variables: --regex-key? As far as I know, we will never really have a need for regex variables in our scheme. Maybe we should require some other kind of variable prefix $key 🤷‍♂️ .

Or we could extend custom maybe? If you give a regex pattern instead of selector string, it searches a tag's content? Just some ideas.

facelessuser commented 5 years ago

Thinking about this more, we really could use custom selectors to do regex. Currently we take a string for a given custom pseudo-class, but we could accept an custom pseudo-class object as well. The object could take a selector, a text search value regex or string. You could even extend it to allow attribute values as well:

So just thinking out loud here. Assuming custom is a hashable object

import soupsieve as sv
import re

custom = {
    ':--custom-pseudo': sv.CustomPseudo(
        'p.class',
        text=re.compile(r'test-[a-z\d]+', re.I),
        attr={'data-item': re.compile(r'1[0-9]{2}')}
    )
}

sv.compile('article div > :custom-pseudo', custom=custom)

It may even be possible to allow a custom function, but I'm not sure yet. As long as things remained hashable and pickle-able, it would be doable, but I imagined this may not always behave proper sending in a function, as the patterns get cached. Caching a pattern with a function does not guarantee you'd get the same behavior....I think I'd pass on functions for now.

facelessuser commented 5 years ago

Another possibility is to extend contains and the attribute equal case to accept custom template variables: $var.

You would define regular expressions with custom variable names which could be a valid identifier with a $ prefix.

regexp = {
    'content-pattern': re.compile(r'test-[a-z\d]+', re.I),
    'attr-pattern': re.compile(r'1[0-9]{2}')
}

sv.compile('p:contains($content-pattern)[data-item=$attr-pattern]', regexp=regexp)

Maybe this is the most straight forward approach? If nothing, it is another option. Custom patterns may still need a way to provide regex when defining them.

facelessuser commented 4 years ago

If we end up doing #175, this would not be needed.