egonSchiele / HandsomeSoup

Easy HTML parsing for Haskell
http://egonschiele.github.com/HandsomeSoup
BSD 3-Clause "New" or "Revised" License
124 stars 20 forks source link

The underscore in css class selector #8

Open nikita-volkov opened 11 years ago

nikita-volkov commented 11 years ago

Queries like css ".some_class" produce no results when in fact there are elements with that class

nikita-volkov commented 11 years ago

Just noticed that it isn't stably reproducable

egonSchiele commented 11 years ago

Can you give me an example where this issue occurred? With my test file in tests/test.html, the selector works just fine.

nikita-volkov commented 11 years ago

Here's an example of multiple correct selectors which produce no results:

main = do 
  print $ runLA (hread >>> tags //> getText) $ html
  where
    tags = css "div .fl_r" <+> css ".info .audio_add_wrap" <+> css ".fl_l .fl_l" <+> css "div.title_wrap"
    html = "<div class=\"audio  fl_l\" id=\"audio126257070_150084772\" onmouseover=\"addClass(this, 'over');\" onmouseout=\"removeClass(this, 'over');\">  <a name=\"126257070_150084772\"></a>  <div class=\"area clear_fix\" onclick=\"if (cur.cancelClick){ cur.cancelClick = false; return false;} playAudioNew('126257070_150084772')\">    <div class=\"play_btn fl_l\">      <div class=\"play_btn_wrap\"><div class=\"play_new\" id=\"play126257070_150084772\"></div></div>      <input type=\"hidden\" id=\"audio_info126257070_150084772\" value=\"http://cs1-4.userapi.com/d33/859470469ef948.mp3,221\" />    </div>    <div class=\"info fl_l\">      <div class=\"title_wrap fl_l\" onmouseover=\"setTitle(this);\"><b><a href=\"/search?c[q]=Michel%20Tel%5C%F3&c[section]=audio\" onclick=\"if (checkEvent(event)) { event.cancelBubble = true; return}; Audio.selectPerformer(event, 'Michel Tel\&#243;'); return false\">Michel Tel&#243;</a></b> &ndash; <span class=\"title\"><a href=\"\" onclick=\"Audio.showLyrics('126257070_150084772',24208242,1); return cancelEvent(event);\">Bara Bar&#225; Bere Ber&#234;</a> </span><span class=\"user\" onclick=\"event.cancelBubble = true;\"></span></div>      <div class=\"actions\">        <div class=\"audio_add_wrap fl_r\" onmouseover=\"Audio.rowActive(this, 'Добавить в мои аудиозаписи', [9, 5, 0]);\" onmouseout=\"Audio.rowInactive(this);\" onclick=\"Audio.addShareAudio(this, 150084772, 126257070, 'a74352b70e39439b99', 0, 1); return cancelEvent(event);\">  <div class=\"audio_add\"></div></div>      </div>      <div class=\"duration fl_r\">3:41</div>    </div>  </div>  <div id=\"lyrics126257070_150084772\" class=\"lyrics\" nosorthandle=\"1\"></div></div>"
egonSchiele commented 11 years ago

Here's what I get:

["  ","3:41","  ","3:41","  ","3:41","  ","  ","      ","      ","    ","      ","Michel Tel"," "," ","Bara Bar"," Bere Ber"," ","      ","        ","  ","      ","      ","3:41","    ","Michel Tel"," "," ","Bara Bar"," Bere Ber"," ","Michel Tel"," "," ","Bara Bar"," Bere Ber"," ","Michel Tel"," "," ","Bara Bar"," Bere Ber"," "]

What versions of GHC and HXT do you have, and what platform are you on? Does upgrading ghc / hxt fix the issue?

nikita-volkov commented 11 years ago

Strange it is stably reproducable on my two machines: OSX 10.8.2 with GHC 7.4.2, HXT 9.3.0.1, upgraded to 9.3.1.1. Ubuntu 12.10 (64-bit, AMD processor) with GHC 7.4.2, HXT 9.3.1.1.

Just in case, the import statements are:

import Text.XML.HXT.Core
import Text.HandsomeSoup

Could it probably be a locale/UTF/special symbols related issue?

I must note that most other selectors work fine.

egonSchiele commented 11 years ago

Does it work for you with the special symbols removed? Or does the following work for you?

runX $ parseHtml html >>> multi (hasAttrValue "class" (elem "info" . words)) //> getText

That's the equivalent translation to pure HXT. And what happens when you add a multi in front of tags?

print $ runLA (hread >>> multi tags //> getText) $ html

I would expect this to give you duplicated text.

nikita-volkov commented 11 years ago

Adding multi in front of tags still produces an empty list.

Clearing the HTML from ampersands does not help.

Concerning your second question, I think you've presented a selector for css ".info", which works fine for me. The one not working is css ".info .audio_add_wrap", and I've tried the following HXT translation of it and it worked fine

runX $ parseHtml html >>> multi (hasAttrValue "class" (elem "info" . words)) >>> multi (hasAttrValue "class" (elem "audio_add_wrap" . words)) //> getText

So I guess the problem is still somewhere in HandsomeSoup

pooya-raz commented 11 years ago

I think I'm having similar problems, but with hyphens. This doesn't work links <- runX $ doc >>> css ".item-result" but this works: links <- runX $ doc >>> css "div" >>> hasAttrValue "class" (=="item-result")

And this is the html source I'm parsing: http://pastebin.com/HNmRvFC6

egonSchiele commented 11 years ago

I'll take a look.

ibotty commented 11 years ago

@pooster: are you running version 0.3.2 or the latest hackage 0.3.1. 0.3.2 fixed a few bugs that look similar to yours.

egonSchiele commented 11 years ago

@pooster your example works for me with version 0.3.2 (it doesn't work with version 0.3.1).

egonSchiele commented 11 years ago

@nikita-volkov it's been a while but could you verify that you're still having this issue. I don't see it.

nikita-volkov commented 11 years ago

In a couple of days. Yes. 17.08.2013 7:56 ÐÏÌØÚÏ×ÁÔÅÌØ "Aditya Bhargava" notifications@github.com ÎÁÐÉÓÁÌ:

@nikita-volkov https://github.com/nikita-volkov it's been a while but could you verify that you're still having this issue. I don't see it.

Reply to this email directly or view it on GitHubhttps://github.com/egonSchiele/HandsomeSoup/issues/8#issuecomment-22805069 .