aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
383 stars 31 forks source link

parse $ on classes list? #14

Closed liweijian closed 7 years ago

liweijian commented 7 years ago

According to Readme, we could get the text of some class by

"<p class='Hello'>World!</p>" |> parse $ ".Hello" |> R.leaf_text;;
- : string = "World!"

I was wondering how to get the text of a classes list? For example

<p class='Hello Hey'>World!</p>"
aantron commented 7 years ago

Hi,

There are at least two options, depending on what you want.

  1. attribute or R.attribute gives you the raw text of the class attribute:

    "<p class='Hello Hey'>World!</p>" |> parse $ "p" |> R.attribute "class";;
    - : string = "Hello Hey"

    The difference between the two is that attribute returns an option, so it will be Some "Hello Hey" above, and None if the attribute is absent; while R.attribute will throw an exception if the attribute is missing (R stands for "require").

  2. classes gives you the list of classes found in the class attribute:

    "<p class='Hello Hey'>World!</p>" |> parse $ "p" |> classes;;
    - : string list = ["Hello"; "Hey"]
liweijian commented 7 years ago

Thank you for your quick reply, actually what I want is to get the text of p element by class in a large html document.

Finally I got the answer

"<p class='Hello Hey'>World!</p>" |> parse $ ".Hello.Hey" |> R.leaf_text;;

aantron commented 7 years ago

Ah, yes, I see what you mean now :)

The only thing I would add is that if your <p> element can have child elements, you may want to do

(* ... *) $ ".Hello.Hey" |> texts |> String.concat ""

http://aantron.github.io/lambda-soup/#VALtexts

I'm actually not certain leaf_text is a good idea to even have in the API, but it's there...

liweijian commented 7 years ago

@aantron

I am sorry to bother you again, I was wondering how may I using lambda soup to getElementById()?

<div id='one'> 11</div><div id='two'> 22 </div>

For example, I want to get the text of some specific id?

aantron commented 7 years ago

No worries, this is not bothering :)

You can do it like this:

let soup = parse "<div id='one'> 11</div><div id='two'> 22 </div>") in
soup $ "#two" |> R.leaf_text

(I've split the code up into two lines, compared to before). This gives

- : string = " 22 "

In general, you may want to refer to the list of CSS selectors, whether here, or in your favorite CSS tutorial :)