Check if search index out of bounds

James-LG / Skyscraper

Rust library for scraping HTML using XPath expressions

MIT License

30 stars 3 forks source link

Check if search index out of bounds #21

Closed masc-it closed 6 months ago

masc-it commented 7 months ago

Problem

Let's assume I want to search for a strong[1], with a parent node //div, but some of the divs don't have it. As the search is implemented right now, the code will just panic, since it does not handle out of bound indexing.

Solution Just add a simple check on the search index. I didn't fix it deeper, at the DocumentNodeSet level, since raw indexing is heavily used in a lot of places and would require more effort.

masc-it commented 7 months ago

@James-LG that's awesome news, thanks for your effort!

In the last week I was having a deep dive of the current main branch and noticed some things, which maybe you're already addressing in the new version:

contains(@attribute, 'value') filter
- I've implemented a working solution offline btw
- As an alternative, one could implement a way easier to parse contains symbol, CSS-style, with *= (even though is not XPath compliant, but still, maybe it's worth for a simpler implementation and a seamless transition in case of a CSS migration!)
and / or in predicates
possibility to use @text as attribute, just like you can in chrome. ( e.g. //div[contains(@text, 'nice')] )
and yeah, indexing has a very weak implementation but we already know it.

BTW, I'll take a look to the new nom branch, happy to help if needed :)

James-LG commented 6 months ago

BTW, I'll take a look to the new nom branch, happy to help if needed :)

I'd like to get the basic use-cases working first, so the structure is a bit more settled than it currently is, but after that support on things like the contains functions would be great! XPath is pretty huge so there's lots of parallel work once the basics are in place.

masc-it commented 6 months ago

Clear! Do you have a roadmap in place? Or a discord channel to post updates?

James-LG commented 6 months ago

Since GitHub apparently doesn't have direct messaging I created a brand new discord channel https://discord.gg/jWK42bWK

As for roadmap, I don't have anything formal. Vaguely it will be getting basic steps working / and // (including initial occurrences which behave differently), then filtered expressions like /div[@class='hi'], and go from there.