Closed moees closed 9 years ago
Hi,
It seems the Xerces parser don't respect XHTML namespaces. I created a patch to solve this problem. Also since the parser insists that (according to the HTML 4 spec) the element names must be in upper case the correct xpath is this: //*[@id='home_right']/DIV[1]/DIV[1]/H2/A
Please update to the new version of the plugin (0.0.3) and test again.
Hi Taha, I have tested again with the new changes. I still get no fields. Can you tell me if it worked with you.
Yes, it worked for me. You can even see my test config here:
Thank you for the answer. It worked using the UrlTester.java. But it does'nt work with nutch, i have tested with nutch 1.7 and nutch 1.9. Do you have any idea about where should i be digging to make it work with nutch.
Thanks
Would you please add this line to log4j.properties inside conf directory of nutch: log4j.logger.ir.co.bayan=DEBUG
Then rerun nutch and see what is reported after parsing. The line in the log should begin with this: Parsed document: ....
The missing dependency was added.
@moees Did you manage to get it working? I am having the same problem (Nutch 1.9).
Hi ChanderG, sorry for answering late. I was tring to get it working in nutch 2 but i couldn't because the classe ParseResults is not included in nutch 2, and i didn't want to change the core source code. I will maybe manage to get it working in nutch 1 if i don't find a solution.
Hi. I cannot extract data with the xpath engine. I tried with several sites and several pages. here's the example i'm using: url:http://gsas.harvard.edu/ expr value="//*[@id='home_right']/div[1]/div[1]/h2/a"
It works with the css engine: expr value=".header>h2>a"
Can you please tell me what i'm doing wrong? Thanks in advance