code4craft / webmagic

A scalable web crawler framework for Java.
http://webmagic.io/
Apache License 2.0
11.43k stars 4.18k forks source link

Configurable Spider #106

Open code4craft opened 10 years ago

code4craft commented 10 years ago

Write spider by config file or scripts.

Choices:

1. xml

<spider>
    <site>
        <charset>utf-8</charset>
        <user-agent></user-agent>
        <cookies>
            <cookie domain="" path="" name="" value="">
            </cookie>
        </cookies>
        <heads>
            <head name="" value=""/>
        </heads>
    </site>

    <startUrls>
        <url></url>
    </startUrls>

    <extraction targetUrl="" helpUrl="">
        <field name="title">
            <extractor type="xpath" value="//div[@class='title']"/>
        </field>
        <field name="content">
            <extractor type="xpath" value="//div[@class='content']"/>
        </field>
    </extraction>

</spider>

2. json

3. yaml

4.javascript

var name=xpath("//h1[@class='entry-title public']/strong/a/text()")
var readme=xpath("//div[@id='readme']/tidyText()")
var star=xpath("//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()")

5.jruby

name= xpath "//h1[@class='entry-title public']/strong/a/text()"
readme = xpath "//div[@id='readme']/tidyText()"
star = xpath "//ul[@class='pagehead-actions']/li[1]//a[@class='social-count js-social-count']/text()"
fork = xpath "//ul[@class='pagehead-actions']/li[2]//a[@class='social-count']/text()"

6. Java

Just write PageProcessor and load it dynamicly…

7. Groovy

8. Scala

sebastien-ma commented 10 years ago

这方面是否可以考虑groovy或者scala?

linkerlin commented 10 years ago

还是Groovy好一些。