fimad / scalpel

A high level web scraping library for Haskell.
Apache License 2.0
323 stars 43 forks source link

Allow selecting bare text nodes #70

Closed fimad closed 5 years ago

fimad commented 5 years ago

Unfortunately I don't think position would help with that example since there is currently no way to select bare text nodes. One of the assumptions scalpel makes is that anything you'd want to select is between <tags>.

It's also not immediately clear how to expose bare text selection in a way that would be backwards compatible. My current thinking is to create an additional value for SelectNode for text nodes. That would let you do something like the following to grab the second text node under an <h2>:

chroot "h2" $ 
  chroots textSelector $ do
    p <- position
    guard (p == 1)
    text textSelector

With an API like the one proposed in #21 you could do something even more snazzy like: text ("h2" /// textSelector) to grab just the text nodes that are direct children of the <h2>.

The potential issue here though is that allowing selection of bare text nodes would create a breaking change in the behavior of anySelector. For example, scrapeStringLike "<a>text</a>" $ texts anySelector currently returns Just ["text"] but if we treated each text node as selectable then it would return Just ["text", "text"].

This might be an OK breaking change though since I think the most useful use of anySelector is to select the current root node in a chroot block like the examples in the read me.

Originally posted by @fimad in https://github.com/fimad/scalpel/issues/48#issuecomment-462009620

fimad commented 5 years ago

This has been fixed in release 0.6.0.