Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

How do I open mapdb files? #117

Closed bagheri471 closed 9 years ago

bagheri471 commented 9 years ago

How do I open mapdb files?

bagheri471 commented 9 years ago

How do I read from mapdb files?

essiembre commented 9 years ago

You would have to use the MapDB API to do so, available from: http://www.mapdb.org/

MapDB is meant to be used via programming though. If that's not what you would like, you can look at using a different storage system for storing URLs and related data with this tag in your config:

<crawlDataStoreFactory class="..." />

If you look at the bottom of the configuration documentation page, you will see other implementations available: MVStore, JDBC, and MongoDB.

So, if for instance you would like to make queries/reports on crawled URLs, JDBC maybe best. Keep in mind though the JDBC implementation won't be nearly as fast, and can even slow down considerably your crawler when reaching several thousand URLs (MapDB's performance is constant).

Based on what you would like to do, you can also consider implementing crawler event listeners to log the activities you are after with precision.

bagheri471 commented 9 years ago

​Dear Pascal Essiembre Thanks, Can I save my friend's profiles by this crawler?

essiembre commented 9 years ago

From a technical aspect, you can do anything you like with the data. Legally, I would check the terms and conditions of the sites you are crawling. Can you please elaborate? What friend's profile? Do you mean Facebook profiles? If so maybe this article can help.

bagheri471 commented 9 years ago

​I want to crawl profiles and posts of users (or my friends) on Facebook. Can you help me?

On Wed, Jun 10, 2015 at 12:02 AM, Pascal Essiembre <notifications@github.com

wrote:

From a technical aspect, you can do anything you like with the data. Legally, I would check the terms and conditions of the sites you are crawling. Can you please elaborate? What friend's profile? Do you mean Facebook profiles? If so maybe this article http://www.norconex.com/how-to-crawl-facebook/ can help.

— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/117#issuecomment-110478129 .

E.Bagheri

essiembre commented 9 years ago

The link I have given you earlier, is a tutorial that will show you how to do that. Did you try it? Here it is again: http://www.norconex.com/how-to-crawl-facebook/

bagheri471 commented 9 years ago

Thanks. I used it and results is ​in "C:\temp\facebook-crawler-example." but I don't know how to use mapdb files in this address "C:\temp\facebook-crawler-example" and how can I use crawled data?

On Thu, Jun 11, 2015 at 6:28 PM, Pascal Essiembre notifications@github.com wrote:

The link I have given you earlier, is a tutorial that will show you how to do that. Did you try it? Here it is again: http://www.norconex.com/how-to-crawl-facebook/

— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/117#issuecomment-111144449 .

E.Bagheri

essiembre commented 9 years ago

What you do with the crawl data is the job of a "Committer". You have to download one that suits your need or create one yourself. See the list of existing ones here.

If you do not install one, there is one available already which is the FileSystemCommitter. That one just saves in flat files the information collected. You can read those yourself and do something with them, but it may be just as easy to write your own committer.

bagheri471 commented 9 years ago

Dear Pascal,

Thanks.

I run this command:

collector-http.bat -a start -c facebook-crawler-2015-02-04/facebook-crawler/facebook-config.xml

Result:

WARN [KeepOnlyTagger] Configuring fields to keep via the "fields" attribute is

now deprecated. Now use the element instead.

INFO [HttpCrawlerConfig] Link extractor loaded: com.norconex.blog.facebook.craw

ler.FacebookLinkExtractor@1ae2fd0

INFO [AbstractCollectorConfig] Configuration loaded: id=Facebook Collector; log

sDir=/temp/facebook-crawler-example/logs; progressDir=/temp/facebook-crawler-exa

mple/progress

INFO [JobSuite] JEF work directory is: \temp\facebook-crawler-example\progress

INFO [JobSuite] JEF log manager is : FileLogManager

INFO [JobSuite] JEF job status store is : FileJobStatusStore

INFO [AbstractCollector] Suite of 1 crawler jobs created.

INFO [JobSuite] Initialization...

INFO [JobSuite] Previous execution detected.

INFO [JobSuite] Backing up previous execution status and log files.

INFO [JobSuite] Starting execution.

INFO [JobSuite] Running Facebook Posts: BEGIN (Sun Jun 14 06:11:47 IRDT 2015)

INFO [MapDBCrawlDataStore] Initializing reference store \temp\facebook-crawler-

example/crawlstore/mapdb/Facebook_32_Posts/

INFO [MapDBCrawlDataStore] \temp\facebook-crawler-example/crawlstore/mapdb/Face

book_32_Posts/: Done initializing databases.

INFO [HttpCrawler] Facebook Posts: RobotsTxt support: false

INFO [HttpCrawler] Facebook Posts: RobotsMeta support: false

INFO [HttpCrawler] Facebook Posts: Sitemap support: false

INFO [CrawlerEventManager] CRAWLER_STARTED

INFO [AbstractCrawler] Facebook Posts: Crawling references...

ERROR [AbstractCrawler] Facebook Posts: Could not process document: https://grap

h.facebook.com/v2.2/disney/posts?limit=10 (Could not stream URL: https://graph.f

acebook.com/oauth/access_token?client_id=4...5&client_secret=c...3&grant_type=client_credentials )

com.norconex.commons.lang.url.URLException: Could not stream URL: https://graph.

facebook.com/oauth/access_token?client_id=4...5&client_secret=c...3&grant_type=client_credentials

    at

com.norconex.commons.lang.url.URLStreamer.stream(URLStreamer.java:174

)

    at

com.norconex.commons.lang.url.URLStreamer.stream(URLStreamer.java:107

)

    at

com.norconex.commons.lang.url.URLStreamer.streamToString(URLStreamer.

java:223)

    at

com.norconex.commons.lang.url.URLStreamer.streamToString(URLStreamer.

java:319)

    at

com.norconex.commons.lang.url.URLStreamer.streamToString(URLStreamer.

java:347)

    at

com.norconex.blog.facebook.crawler.FacebookDocumentFetcher.ensureAcce

ssToken(FacebookDocumentFetcher.java:86)

    at

com.norconex.blog.facebook.crawler.FacebookDocumentFetcher.fetchDocum

ent(FacebookDocumentFetcher.java:59)

    at

com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$Do

cumentFetcherStage.executeStage(HttpImporterPipeline.java:147)

    at

com.norconex.collector.http.pipeline.importer.AbstractImporterStage.e

xecute(AbstractImporterStage.java:31)

    at

com.norconex.collector.http.pipeline.importer.AbstractImporterStage.e

xecute(AbstractImporterStage.java:24)

    at

com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)

    at

com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeli

ne(HttpCrawler.java:213)

    at

com.norconex.collector.core.crawler.AbstractCrawler.processNextQueued

CrawlData(AbstractCrawler.java:473)

    at

com.norconex.collector.core.crawler.AbstractCrawler.processNextRefere

nce(AbstractCrawler.java:373)

    at

com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnab

le.run(AbstractCrawler.java:631)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown

Source)

    at java.lang.Thread.run(Unknown Source)

Caused by: java.net.SocketException: Permission denied: connect

    at java.net.DualStackPlainSocketImpl.connect0(Native Method)

    at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)

    at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)

    at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)

    at java.net.AbstractPlainSocketImpl.connect(Unknown Source)

    at java.net.PlainSocketImpl.connect(Unknown Source)

    at java.net.SocksSocketImpl.connect(Unknown Source)

    at java.net.Socket.connect(Unknown Source)

    at sun.security.ssl.SSLSocketImpl.connect(Unknown Source)

    at sun.security.ssl.BaseSSLSocketImpl.connect(Unknown Source)

    at sun.net.NetworkClient.doConnect(Unknown Source)

    at sun.net.www.http.HttpClient.openServer(Unknown Source)

    at sun.net.www.http.HttpClient.openServer(Unknown Source)

    at sun.net.www.protocol.https.HttpsClient.<init>(Unknown Source)

    at sun.net.www.protocol.https.HttpsClient.New(Unknown Source)

    at

sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewH

ttpClient(Unknown Source)

    at

sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Sou

rce)

    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown

Sour

ce)

    at

sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect

(Unknown Source)

    at

sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(Unknown Sou

rce)

    at

com.norconex.commons.lang.url.URLStreamer.responseInputStream(URLStre

amer.java:370)

    at

com.norconex.commons.lang.url.URLStreamer.stream(URLStreamer.java:172

)

    ... 17 more

INFO [AbstractCrawler] Facebook Posts: 100% completed (1 processed/1 total)

INFO [AbstractCrawler] Facebook Posts: Deleting orphan references (if any)...

INFO [AbstractCrawler] Facebook Posts: Deleted 0 orphan URLs...

INFO [AbstractCrawler] Facebook Posts: Crawler finishing: committing documents.

INFO [AbstractCrawler] Facebook Posts: 1 reference(s) processed.

INFO [CrawlerEventManager] CRAWLER_FINISHED

INFO [AbstractCrawler] Facebook Posts: Crawler completed.

INFO [AbstractCrawler] Facebook Posts: Crawler executed in 22 seconds.

INFO [MapDBCrawlDataStore] Closing reference store: \temp\facebook-crawler-exam

ple/crawlstore/mapdb/Facebook_32_Posts/

INFO [JobSuite] Running Facebook Posts: END (Sun Jun 14 06:11:47 IRDT 2015)

I have Error and only 3 folder in "\temp\facebook-crawler-example":

crawlstore , logs , progress ​

essiembre commented 9 years ago

The URL being invoked on the Facebook graph API to get the access token does not seem to work properly. Have you tested your client id and client secret manually against the Facebook Graph API? Make sure you can get an access token with those when connecting manually before trying in the connector.

You can also try to change the log level to DEBUG in the classes/log4j.properties file in case you get more information.

bagheri471 commented 9 years ago

​​Thanks, problem have been solved. ​I have a file "1434350114406000000-add.cntnt" contain crawled data. how can I used it? I want query on it.

essiembre commented 9 years ago

What you do with the data is yours to decide. The HTTP Collector is a web crawler, not a search engine or query platform. To do something meaningful with the content, you normally would use a Committer to push the content to the repository of your choice (database, search engine, etc). For instance, a common scenario is to use the Solr Committer to push the data to a Solr instance and make queries on that search engine. You can also create your own Committer.

bagheri471 commented 9 years ago

thanks, I'm PhD student. I work on dissemination of information in social networks ​. I need to get posts and profiles information from users in social network and query on it.

On Thu, Jun 18, 2015 at 7:44 AM, Pascal Essiembre notifications@github.com wrote:

What you do with the data is yours to decide. The HTTP Collector is a web crawler, not a search engine or query platform. To do something meaningful with the content, you normally would use a Committer to push the content to the repository of your choice (database, search engine, etc). For instance, a common scenario is to use the Solr Committer http://www.norconex.com/collectors/committer-solr/ to push the data to a Solr instance and make queries on that search engine. You can also create your own Committer http://www.norconex.com/collectors/committer-core/create-your-own.

— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/117#issuecomment-113023836 .

E.Bagheri

essiembre commented 9 years ago

Please create a new issue if you have additional questions.