aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml
https://aantron.github.io/lambdasoup
MIT License
383 stars 31 forks source link

Should texts function include script tags? #38

Closed mooreryan closed 3 years ago

mooreryan commented 3 years ago

Hi aantron, I was parsing some HTML and got a result I thought was interesting. The <script> tags are included in the output of the texts functions. I can see how it would be since the script is text after all, but I was wondering if this was the intended behavior.

Just to make sure I didn't make any mistakes (and to show you what I mean) I made these little tests that pass the lambdasoup test suite.

           ( "texts-just-script-tags" >:: fun _ ->
             let soup = "<script>1 + 1</script>" |> parse in
             assert_equal (texts soup) [ "1 + 1" ] );

           ( "texts-script-tags" >:: fun _ ->
             let soup =
               "<article><div><p>hi</p></div><script>1 + 1</script></article>"
               |> parse
             in
             assert_equal (texts soup) [ "hi"; "1 + 1" ] );

Anyway, just wondering if this is the intended behavior, and if so, I suppose the easiest way would be to just filter out <script> tags before using the texts functions? Thanks!!!

aantron commented 3 years ago

This is the intended behavior so far. Perhaps you can delete <script> tags before calling texts with

Soup.iter Soup.delete (soup $$ "script")
mooreryan commented 3 years ago

Sounds good, thanks! I will close the issue now.