Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

how to crawl wordpress pages? #227

Closed m-gorn closed 8 years ago

m-gorn commented 8 years ago

Hi,

is there anyway to collect on wordpress pages? i used the minimal xml file without results.

THX

essiembre commented 8 years ago

Yes, for web crawls, the collector does not care whether the content is hosted on Wordpress or else. That's just web pages to it. Do you have any errors? What are the logs saying? Can you please attach your config to reproduce?

m-gorn commented 8 years ago

Hi,

i used this xml

`<?xml version="1.0" encoding="UTF-8"?>

./examples-output/minimum/progress ./examples-output/minimum/logs ``` http://www.ehl.de ./examples-output/minimum 3 title,keywords,description,document.reference ./examples-output/minimum/crawledFiles ``` ` And this is the output: `./collector-http.sh -a start -c examples/minimum/minimum-config.xml INFO [HttpCrawlerConfig] Link extractor loaded: GenericLinkExtractor[contentTypes={text/html,application/xhtml+xml,vnd.wap.xhtml+xml,x-asp},maxURLLength=2048,ignoreNofollow=false,ignoreExternalLinks=true,keepReferrerData=false,tagAttribs=ObservableMap [map={iframe=[src], frame=[src], a=[href], img=[src], meta=[http-equiv]}]] INFO [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress INFO [AbstractCollector] Version: Norconex HTTP Collector 2.3.0-SNAPSHOT (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Collector Core 1.3.0-SNAPSHOT (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Importer 2.4.0-SNAPSHOT (Norconex Inc.) INFO [AbstractCollector] Version: Norconex JEF 4.0.6 (Norconex Inc.) INFO [AbstractCollector] Version: Norconex Committer Core 2.0.2 (Norconex Inc.) INFO [JobSuite] JEF work directory is: ./examples-output/minimum/progress INFO [JobSuite] JEF log manager is : FileLogManager INFO [JobSuite] JEF job status store is : FileJobStatusStore INFO [AbstractCollector] Suite of 1 crawler jobs created. INFO [JobSuite] Initialization... INFO [JobSuite] Previous execution detected. INFO [JobSuite] Backing up previous execution status and log files. INFO [JobSuite] Starting execution. INFO [JobSuite] Running ehl: BEGIN (Thu Feb 04 07:52:36 CET 2016) INFO [MapDBCrawlDataStore] Initializing reference store ./examples-output/minimum/crawlstore/mapdb/ehl/ INFO [MapDBCrawlDataStore] ./examples-output/minimum/crawlstore/mapdb/ehl/: Done initializing databases. INFO [HttpCrawler] ehl: RobotsTxt support: true INFO [HttpCrawler] ehl: RobotsMeta support: true INFO [HttpCrawler] ehl: Sitemap support: false INFO [HttpCrawler] ehl: Canonical links support: true INFO [CrawlerEventManager] REJECTED_FILTER: http://www.ehl.de INFO [CrawlerEventManager] REJECTED_ROBOTS_TXT: http://www.ehl.de INFO [CrawlerEventManager] CRAWLER_STARTED INFO [AbstractCrawler] ehl: Crawling references... INFO [AbstractCrawler] ehl: Re-processing orphan references (if any)... INFO [AbstractCrawler] ehl: Reprocessed 0 orphan references... INFO [AbstractCrawler] ehl: Crawler finishing: committing documents. INFO [AbstractCrawler] ehl: 0 reference(s) processed. INFO [CrawlerEventManager] CRAWLER_FINISHED INFO [AbstractCrawler] ehl: Crawler completed. INFO [AbstractCrawler] ehl: Crawler executed in 3 seconds. INFO [MapDBCrawlDataStore] Closing reference store: ./examples-output/minimum/crawlstore/mapdb/ehl/ INFO [JobSuite] Running ehl: END (Thu Feb 04 07:52:36 CET 2016) ` AHHH - i guess it is the "INFO [CrawlerEventManager] REJECTED_ROBOTS_TXT: http://www.ehl.de" Here is the Robots.txt `User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php` Thats it... grrr
essiembre commented 8 years ago

Glad you found the cause. You can ignore robots.txt by using this:

<robotsTxt ignore="true" />