Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

no result when using example #134

Closed jnleec closed 9 years ago

jnleec commented 9 years ago

I use example(http://www.norconex.com/how-to-crawl-facebook/) to crawl facebook, however, i get this error, i use the norconex-collector-http 2.0.2 and the start url is "https://graph.facebook.com/v2.4/disney/posts?fields=from,picture,type,link,created_time,description", if there is no fields, we only get message,created_time and id, the api version is v2.4.

[non-job]: 2015-08-06 15:51:37 INFO - Starting execution.
Facebook Posts: 2015-08-06 15:51:37 INFO - Running Facebook Posts: BEGIN (Thu Aug 06 15:51:37 CST 2015)
Facebook Posts: 2015-08-06 15:51:37 INFO - Initializing reference store F:\facebook-crawler/crawlstore/mapdb/Facebook_32_Posts/
Facebook Posts: 2015-08-06 15:51:39 INFO - F:\facebook-crawler/crawlstore/mapdb/Facebook_32_Posts/: Done initializing databases.
Facebook Posts: 2015-08-06 15:51:39 INFO - Facebook Posts: RobotsTxt support: false
Facebook Posts: 2015-08-06 15:51:39 INFO - Facebook Posts: RobotsMeta support: false
Facebook Posts: 2015-08-06 15:51:39 INFO - Facebook Posts: Sitemap support: false
Facebook Posts: 2015-08-06 15:51:40 INFO -           CRAWLER_STARTED (Subject: com.norconex.collector.http.crawler.HttpCrawler@23f5b5dc)
Facebook Posts: 2015-08-06 15:51:40 INFO - Facebook Posts: Crawling references...
Facebook Posts: 2015-08-06 15:51:43 INFO -          DOCUMENT_FETCHED: https://graph.facebook.com/v2.4/disney/posts?fields=from,picture,type,link,created_time,description&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg (Subject: com.cike.facebook.crawler.FacebookDocumentFetcher@77612365)
Facebook Posts: 2015-08-06 15:51:43 INFO -            URLS_EXTRACTED: https://graph.facebook.com/v2.4/disney/posts?fields=from,picture,type,link,created_time,description&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg (Subject: [https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&limit=25&until=1438113605&__paging_token=enc_AdD39VszzQNMNn3xocgRpuybolmJeaMZAUzzZAO0KBIXn2kAU6kia2kIRS8IhItB5UKAXz6sZC6vzsY5ImrNWdij5G7])
Facebook Posts: 2015-08-06 15:51:43 ERROR - Facebook Posts: Could not process document: https://graph.facebook.com/v2.4/disney/posts?fields=from,picture,type,link,created_time,description (null)
java.lang.NullPointerException
    at org.apache.commons.io.IOUtils.toInputStream(IOUtils.java:1231)
    at com.norconex.commons.lang.io.CachedStreamFactory.newInputStream(CachedStreamFactory.java:121)
    at com.cike.facebook.crawler.FacebookDocumentSplitter.createImportDocument(FacebookDocumentSplitter.java:112)
    at com.cike.facebook.crawler.FacebookDocumentSplitter.splitApplicableDocument(FacebookDocumentSplitter.java:64)
    at com.norconex.importer.handler.splitter.AbstractDocumentSplitter.splitDocument(AbstractDocumentSplitter.java:62)
    at com.norconex.importer.Importer.splitDocument(Importer.java:510)
    at com.norconex.importer.Importer.executeHandlers(Importer.java:342)
    at com.norconex.importer.Importer.importDocument(Importer.java:297)
    at com.norconex.importer.Importer.doImportDocument(Importer.java:259)
    at com.norconex.importer.Importer.importDocument(Importer.java:192)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:479)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:375)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:628)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Facebook Posts: 2015-08-06 15:51:45 INFO -          DOCUMENT_FETCHED: https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&limit=25&until=1438113605&__paging_token=enc_AdD39VszzQNMNn3xocgRpuybolmJeaMZAUzzZAO0KBIXn2kAU6kia2kIRS8IhItB5UKAXz6sZC6vzsY5ImrNWdij5G7 (Subject: com.cike.facebook.crawler.FacebookDocumentFetcher@77612365)
Facebook Posts: 2015-08-06 15:51:45 INFO -            URLS_EXTRACTED: https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&limit=25&until=1438113605&__paging_token=enc_AdD39VszzQNMNn3xocgRpuybolmJeaMZAUzzZAO0KBIXn2kAU6kia2kIRS8IhItB5UKAXz6sZC6vzsY5ImrNWdij5G7 (Subject: [https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&limit=25&__paging_token=enc_AdDa3oeDZAUZCuMUk75KfGZCtnyJBQUWN2cD6d9KPGaScji7yp0gh9GB4ptzzUbe9VoIlIGdNZCqK1xJ6JoO60dnxGOH&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&until=1437496105])
Facebook Posts: 2015-08-06 15:51:45 ERROR - Facebook Posts: Could not process document: https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&limit=25&until=1438113605&__paging_token=enc_AdD39VszzQNMNn3xocgRpuybolmJeaMZAUzzZAO0KBIXn2kAU6kia2kIRS8IhItB5UKAXz6sZC6vzsY5ImrNWdij5G7 (null)
java.lang.NullPointerException
    at org.apache.commons.io.IOUtils.toInputStream(IOUtils.java:1231)
    at com.norconex.commons.lang.io.CachedStreamFactory.newInputStream(CachedStreamFactory.java:121)
    at com.cike.facebook.crawler.FacebookDocumentSplitter.createImportDocument(FacebookDocumentSplitter.java:112)
    at com.cike.facebook.crawler.FacebookDocumentSplitter.splitApplicableDocument(FacebookDocumentSplitter.java:64)
    at com.norconex.importer.handler.splitter.AbstractDocumentSplitter.splitDocument(AbstractDocumentSplitter.java:62)
    at com.norconex.importer.Importer.splitDocument(Importer.java:510)
    at com.norconex.importer.Importer.executeHandlers(Importer.java:342)
    at com.norconex.importer.Importer.importDocument(Importer.java:297)
    at com.norconex.importer.Importer.doImportDocument(Importer.java:259)
    at com.norconex.importer.Importer.importDocument(Importer.java:192)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:479)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:375)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:628)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Facebook Posts: 2015-08-06 15:51:45 INFO - Facebook Posts: 100% completed (2 processed/2 total)
Facebook Posts: 2015-08-06 15:51:47 INFO -          DOCUMENT_FETCHED: https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&limit=25&__paging_token=enc_AdDa3oeDZAUZCuMUk75KfGZCtnyJBQUWN2cD6d9KPGaScji7yp0gh9GB4ptzzUbe9VoIlIGdNZCqK1xJ6JoO60dnxGOH&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&until=1437496105 (Subject: com.cike.facebook.crawler.FacebookDocumentFetcher@77612365)
Facebook Posts: 2015-08-06 15:51:47 INFO -         REJECTED_TOO_DEEP: https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&limit=25&__paging_token=enc_AdBmYC5xRUKqVXezOzZAY5P9QXflIr3lcd1wxfsOpmkEZACARYLHBGZCWnEx1xhrZCjEolkuYVxrwMGszh4t85U8Jcif&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&until=1436902304 (Subject: 3)
Facebook Posts: 2015-08-06 15:51:47 INFO -            URLS_EXTRACTED: https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&limit=25&__paging_token=enc_AdDa3oeDZAUZCuMUk75KfGZCtnyJBQUWN2cD6d9KPGaScji7yp0gh9GB4ptzzUbe9VoIlIGdNZCqK1xJ6JoO60dnxGOH&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&until=1437496105 (Subject: [https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&limit=25&__paging_token=enc_AdBmYC5xRUKqVXezOzZAY5P9QXflIr3lcd1wxfsOpmkEZACARYLHBGZCWnEx1xhrZCjEolkuYVxrwMGszh4t85U8Jcif&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&until=1436902304])
Facebook Posts: 2015-08-06 15:51:47 ERROR - Facebook Posts: Could not process document: https://graph.facebook.com/v2.4/11784025953/posts?fields=from,picture,type,link,created_time,description&limit=25&__paging_token=enc_AdDa3oeDZAUZCuMUk75KfGZCtnyJBQUWN2cD6d9KPGaScji7yp0gh9GB4ptzzUbe9VoIlIGdNZCqK1xJ6JoO60dnxGOH&access_token=1475518649431407%7CZUKVgCda4t8WobIFU2tYByKUDGg&until=1437496105 (null)
java.lang.NullPointerException
    at org.apache.commons.io.IOUtils.toInputStream(IOUtils.java:1231)
    at com.norconex.commons.lang.io.CachedStreamFactory.newInputStream(CachedStreamFactory.java:121)
    at com.cike.facebook.crawler.FacebookDocumentSplitter.createImportDocument(FacebookDocumentSplitter.java:112)
    at com.cike.facebook.crawler.FacebookDocumentSplitter.splitApplicableDocument(FacebookDocumentSplitter.java:64)
    at com.norconex.importer.handler.splitter.AbstractDocumentSplitter.splitDocument(AbstractDocumentSplitter.java:62)
    at com.norconex.importer.Importer.splitDocument(Importer.java:510)
    at com.norconex.importer.Importer.executeHandlers(Importer.java:342)
    at com.norconex.importer.Importer.importDocument(Importer.java:297)
    at com.norconex.importer.Importer.doImportDocument(Importer.java:259)
    at com.norconex.importer.Importer.importDocument(Importer.java:192)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:35)
    at com.norconex.collector.core.pipeline.importer.ImportModuleStage.execute(ImportModuleStage.java:26)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:479)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:375)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:628)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Facebook Posts: 2015-08-06 15:51:47 INFO - Facebook Posts: Deleting orphan references (if any)...
Facebook Posts: 2015-08-06 15:51:47 INFO - Facebook Posts: Deleted 0 orphan URLs...
Facebook Posts: 2015-08-06 15:51:47 INFO - Facebook Posts: Crawler finishing: committing documents.
Facebook Posts: 2015-08-06 15:51:47 INFO - Facebook Posts: 3 reference(s) processed.
Facebook Posts: 2015-08-06 15:51:47 INFO -          CRAWLER_FINISHED (Subject: com.norconex.collector.http.crawler.HttpCrawler@23f5b5dc)
Facebook Posts: 2015-08-06 15:51:47 INFO - Facebook Posts: Crawler completed.
Facebook Posts: 2015-08-06 15:51:47 INFO - Facebook Posts: Crawler executed in 10 seconds.
Facebook Posts: 2015-08-06 15:51:47 INFO - Closing reference store: F:\facebook-crawler/crawlstore/mapdb/Facebook_32_Posts/
Facebook Posts: 2015-08-06 15:51:48 INFO - Running Facebook Posts: END (Thu Aug 06 15:51:37 CST 2015)
essiembre commented 9 years ago

I just tried that same example from the blog with the same version of HTTP Collector (2.0.2) and I also tried with the latest snapshot and both worked just fine with the code and configuration taken as is (with valid API keys specified). The blog example and the code that goes with it is using the version 2.2 of the Facebook Graph API (was the latest at the time). Can you try with that version of the Graph API? If that also fails, can you share your config (stripping your secret values).

If you need to use the 2.4 version of the API and it does not work for you, you may have to update the code given in the blog accordingly (or contact Norconex professional services for help). The JSON format returned may have changed.

jnleec commented 9 years ago

Thanks for your response. I have tried v2.2 of the api, however, the same with v2.4, it returned only three field when I run the example code with "https://graph.facebook.com/v2.2/disney/posts" as start url.

Here is my config file, thank you!

<?xml version="1.0" encoding="UTF-8"?>

#set($facebook = "com.cike.facebook.crawler") #set($httpcollector = "com.norconex.collector.http") #set($importer = "com.norconex.importer") #set($importFilter = "${importer}.handler.filter.impl") #set($importTagger = "${importer}.handler.tagger.impl") #set($committerCore = "com.norconex.committer.core.impl") #set($workdir = "F:\facebook-crawler") ${workdir}/progress ${workdir}/logs ``` 2 2 10 $workdir DELETE false https://graph.facebook.com/v2.4/disney/posts?fields=from,picture,type,link,created_time,description *** *** ^https://graph.facebook.com/v2.4/.*?/posts\W.* ${workdir}/crawledFiles ```
essiembre commented 9 years ago

The code that comes with the blog is not meant to be an all-purpose Facebook crawler without modifications. You have to adapt it to your needs. The reason you get the error you have is because the FacebookDocumentSplitter class is expecting the "message" field to be retrieved. If you add it to your list of fields, it run without errors when I try it.

Keep in mind if you add Facebook fields not expected in the sample code, it won't do anything about them and they will be ignored. Also, document.reference is a field added by the collector, and is not a Facebook field. Using the Facebook Graph API Explorer, you can find out what are all the Facebook fields.

jnleec commented 9 years ago

Thank you very much!

jnleec commented 9 years ago

hi, how many documents you can get when using the Facebook crawler sample code? I can only get a little, however, I have check the "next" url, and it works well.

INFO [AbstractCrawler] Facebook Posts: Maximum documents reached: 10 INFO [AbstractCrawler] Facebook Posts: Maximum documents reached: 10 INFO [AbstractCrawler] Facebook Posts: Deleting orphan references (if any)... INFO [AbstractCrawler] Facebook Posts: Deleted 0 orphan URLs... INFO [AbstractCrawler] Facebook Posts: Crawler finishing: committing documents. INFO [AbstractCrawler] Facebook Posts: 52 reference(s) processed. INFO [CrawlerEventManager] CRAWLER_FINISHED (Subject: com.norconex.collector.http.crawler.HttpCrawler@7296c1fc) INFO [AbstractCrawler] Facebook Posts: Crawler completed. INFO [AbstractCrawler] Facebook Posts: Crawler executed in 4 seconds.

essiembre commented 9 years ago

I see the following limitations in your config:

  <maxDepth>2</maxDepth>
  <maxDocuments>10</maxDocuments>

Taking those off (or setting them to -1) should give you more.

jnleec commented 9 years ago

wow, thank you!