fimad / scalpel

A high level web scraping library for Haskell.
Apache License 2.0
323 stars 43 forks source link

nested selector gives redundant result #28

Closed debug-ito closed 8 years ago

debug-ito commented 8 years ago

With scalpel-0.3.0.1, I ran the following.

import Text.HTML.Scalpel (Scraper, attrs, (//), scrapeStringLike)

nestedDivs :: String
nestedDivs = "<div id=\"outer\"><div id=\"inner\">inner text</div></div>"

idScraper :: Scraper String [String]
idScraper = attrs "id" ("div" // "div")

main :: IO ()
main = do
  print $ scrapeStringLike nestedDivs idScraper

and got the result:

Just ["outer","inner","inner"]

but I had expected:

Just ["inner"]

I don't understand why I got that result. Is it a bug, or it's expected behavior?

fimad commented 8 years ago

Hrm, I think there are two bugs happening here.

The way that selectors work is that you have a cursor pointing at an opening tag and corresponding closing tag pair and this starts at the root tag. When you apply a selector it first looks at the current tag. If the current tag does not match, the cursor shrinks to the next opening closing tag pair. This happens recursively until a match happens. If you run out of tags it backtracks to siblings.

Right now the cursor only descends when a match fails which is why "div" // "div" will unintuitively match the outer div. I think the fix here is that // should probably force a decent.

I think the reason that inner is showing up twice is due to the backtracking above. There are two ways that inner can satisfy the selector (outer // inner, and inner // inner) so it shows up twice in the result. We should probably be de-duping the matches before surfacing them to the user.

debug-ito commented 8 years ago

Thanks for explaining!

Personally, I'm not in a hurry for the fix.