extraction with xpath engin

BayanGroup / nutch-custom-search

65 stars 34 forks source link

extraction with xpath engin #9

Closed moees closed 9 years ago

moees commented 9 years ago

Hi. I cannot extract data with the xpath engine. I tried with several sites and several pages. here's the example i'm using: url:http://gsas.harvard.edu/ expr value="//*[@id='home_right']/div[1]/div[1]/h2/a"

It works with the css engine: expr value=".header>h2>a"

Can you please tell me what i'm doing wrong? Thanks in advance

tahagh commented 9 years ago

Hi,

It seems the Xerces parser don't respect XHTML namespaces. I created a patch to solve this problem. Also since the parser insists that (according to the HTML 4 spec) the element names must be in upper case the correct xpath is this: //*[@id='home_right']/DIV[1]/DIV[1]/H2/A

Please update to the new version of the plugin (0.0.3) and test again.

moees commented 9 years ago

Hi Taha, I have tested again with the new changes. I still get no fields. Can you tell me if it worked with you.

tahagh commented 9 years ago

Yes, it worked for me. You can even see my test config here:

https://github.com/BayanGroup/nutch-custom-search/blob/master/zal.extractor/src/test/resources/teste.xml

https://github.com/BayanGroup/nutch-custom-search/blob/master/zal.extractor/src/main/java/ir/co/bayan/simorq/zal/extractor/util/UrlTester.java

moees commented 9 years ago

Thank you for the answer. It worked using the UrlTester.java. But it does'nt work with nutch, i have tested with nutch 1.7 and nutch 1.9. Do you have any idea about where should i be digging to make it work with nutch.

Thanks

tahagh commented 9 years ago

Would you please add this line to log4j.properties inside conf directory of nutch: log4j.logger.ir.co.bayan=DEBUG

Then rerun nutch and see what is reported after parsing. The line in the log should begin with this: Parsed document: ....

tahagh commented 9 years ago

The missing dependency was added.

ChanderG commented 9 years ago

@moees Did you manage to get it working? I am having the same problem (Nutch 1.9).

moees commented 9 years ago

Hi ChanderG, sorry for answering late. I was tring to get it working in nutch 2 but i couldn't because the classe ParseResults is not included in nutch 2, and i didn't want to change the core source code. I will maybe manage to get it working in nutch 1 if i don't find a solution.