Closed bagheri471 closed 9 years ago
How do I read from mapdb files?
You would have to use the MapDB API to do so, available from: http://www.mapdb.org/
MapDB is meant to be used via programming though. If that's not what you would like, you can look at using a different storage system for storing URLs and related data with this tag in your config:
<crawlDataStoreFactory class="..." />
If you look at the bottom of the configuration documentation page, you will see other implementations available: MVStore, JDBC, and MongoDB.
So, if for instance you would like to make queries/reports on crawled URLs, JDBC maybe best. Keep in mind though the JDBC implementation won't be nearly as fast, and can even slow down considerably your crawler when reaching several thousand URLs (MapDB's performance is constant).
Based on what you would like to do, you can also consider implementing crawler event listeners to log the activities you are after with precision.
Dear Pascal Essiembre Thanks, Can I save my friend's profiles by this crawler?
From a technical aspect, you can do anything you like with the data. Legally, I would check the terms and conditions of the sites you are crawling. Can you please elaborate? What friend's profile? Do you mean Facebook profiles? If so maybe this article can help.
I want to crawl profiles and posts of users (or my friends) on Facebook. Can you help me?
On Wed, Jun 10, 2015 at 12:02 AM, Pascal Essiembre <notifications@github.com
wrote:
From a technical aspect, you can do anything you like with the data. Legally, I would check the terms and conditions of the sites you are crawling. Can you please elaborate? What friend's profile? Do you mean Facebook profiles? If so maybe this article http://www.norconex.com/how-to-crawl-facebook/ can help.
— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/117#issuecomment-110478129 .
E.Bagheri
The link I have given you earlier, is a tutorial that will show you how to do that. Did you try it? Here it is again: http://www.norconex.com/how-to-crawl-facebook/
Thanks. I used it and results is in "C:\temp\facebook-crawler-example." but I don't know how to use mapdb files in this address "C:\temp\facebook-crawler-example" and how can I use crawled data?
On Thu, Jun 11, 2015 at 6:28 PM, Pascal Essiembre notifications@github.com wrote:
The link I have given you earlier, is a tutorial that will show you how to do that. Did you try it? Here it is again: http://www.norconex.com/how-to-crawl-facebook/
— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/117#issuecomment-111144449 .
E.Bagheri
What you do with the crawl data is the job of a "Committer". You have to download one that suits your need or create one yourself. See the list of existing ones here.
If you do not install one, there is one available already which is the FileSystemCommitter. That one just saves in flat files the information collected. You can read those yourself and do something with them, but it may be just as easy to write your own committer.
Dear Pascal,
Thanks.
I run this command:
collector-http.bat -a start -c facebook-crawler-2015-02-04/facebook-crawler/facebook-config.xml
Result:
WARN [KeepOnlyTagger] Configuring fields to keep via the "fields" attribute is
now deprecated. Now use the
INFO [HttpCrawlerConfig] Link extractor loaded: com.norconex.blog.facebook.craw
ler.FacebookLinkExtractor@1ae2fd0
INFO [AbstractCollectorConfig] Configuration loaded: id=Facebook Collector; log
sDir=/temp/facebook-crawler-example/logs; progressDir=/temp/facebook-crawler-exa
mple/progress
INFO [JobSuite] JEF work directory is: \temp\facebook-crawler-example\progress
INFO [JobSuite] JEF log manager is : FileLogManager
INFO [JobSuite] JEF job status store is : FileJobStatusStore
INFO [AbstractCollector] Suite of 1 crawler jobs created.
INFO [JobSuite] Initialization...
INFO [JobSuite] Previous execution detected.
INFO [JobSuite] Backing up previous execution status and log files.
INFO [JobSuite] Starting execution.
INFO [JobSuite] Running Facebook Posts: BEGIN (Sun Jun 14 06:11:47 IRDT 2015)
INFO [MapDBCrawlDataStore] Initializing reference store \temp\facebook-crawler-
example/crawlstore/mapdb/Facebook_32_Posts/
INFO [MapDBCrawlDataStore] \temp\facebook-crawler-example/crawlstore/mapdb/Face
book_32_Posts/: Done initializing databases.
INFO [HttpCrawler] Facebook Posts: RobotsTxt support: false
INFO [HttpCrawler] Facebook Posts: RobotsMeta support: false
INFO [HttpCrawler] Facebook Posts: Sitemap support: false
INFO [CrawlerEventManager] CRAWLER_STARTED
INFO [AbstractCrawler] Facebook Posts: Crawling references...
ERROR [AbstractCrawler] Facebook Posts: Could not process document: https://grap
h.facebook.com/v2.2/disney/posts?limit=10 (Could not stream URL: https://graph.f
acebook.com/oauth/access_token?client_id=4...5&client_secret=c...3&grant_type=client_credentials )
com.norconex.commons.lang.url.URLException: Could not stream URL: https://graph.
facebook.com/oauth/access_token?client_id=4...5&client_secret=c...3&grant_type=client_credentials
at
com.norconex.commons.lang.url.URLStreamer.stream(URLStreamer.java:174
)
at
com.norconex.commons.lang.url.URLStreamer.stream(URLStreamer.java:107
)
at
com.norconex.commons.lang.url.URLStreamer.streamToString(URLStreamer.
java:223)
at
com.norconex.commons.lang.url.URLStreamer.streamToString(URLStreamer.
java:319)
at
com.norconex.commons.lang.url.URLStreamer.streamToString(URLStreamer.
java:347)
at
com.norconex.blog.facebook.crawler.FacebookDocumentFetcher.ensureAcce
ssToken(FacebookDocumentFetcher.java:86)
at
com.norconex.blog.facebook.crawler.FacebookDocumentFetcher.fetchDocum
ent(FacebookDocumentFetcher.java:59)
at
com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$Do
cumentFetcherStage.executeStage(HttpImporterPipeline.java:147)
at
com.norconex.collector.http.pipeline.importer.AbstractImporterStage.e
xecute(AbstractImporterStage.java:31)
at
com.norconex.collector.http.pipeline.importer.AbstractImporterStage.e
xecute(AbstractImporterStage.java:24)
at
com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
at
com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeli
ne(HttpCrawler.java:213)
at
com.norconex.collector.core.crawler.AbstractCrawler.processNextQueued
CrawlData(AbstractCrawler.java:473)
at
com.norconex.collector.core.crawler.AbstractCrawler.processNextRefere
nce(AbstractCrawler.java:373)
at
com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnab
le.run(AbstractCrawler.java:631)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Permission denied: connect
at java.net.DualStackPlainSocketImpl.connect0(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at sun.security.ssl.SSLSocketImpl.connect(Unknown Source)
at sun.security.ssl.BaseSSLSocketImpl.connect(Unknown Source)
at sun.net.NetworkClient.doConnect(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.http.HttpClient.openServer(Unknown Source)
at sun.net.www.protocol.https.HttpsClient.<init>(Unknown Source)
at sun.net.www.protocol.https.HttpsClient.New(Unknown Source)
at
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewH
ttpClient(Unknown Source)
at
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Sou
rce)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown
Sour
ce)
at
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect
(Unknown Source)
at
sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(Unknown Sou
rce)
at
com.norconex.commons.lang.url.URLStreamer.responseInputStream(URLStre
amer.java:370)
at
com.norconex.commons.lang.url.URLStreamer.stream(URLStreamer.java:172
)
... 17 more
INFO [AbstractCrawler] Facebook Posts: 100% completed (1 processed/1 total)
INFO [AbstractCrawler] Facebook Posts: Deleting orphan references (if any)...
INFO [AbstractCrawler] Facebook Posts: Deleted 0 orphan URLs...
INFO [AbstractCrawler] Facebook Posts: Crawler finishing: committing documents.
INFO [AbstractCrawler] Facebook Posts: 1 reference(s) processed.
INFO [CrawlerEventManager] CRAWLER_FINISHED
INFO [AbstractCrawler] Facebook Posts: Crawler completed.
INFO [AbstractCrawler] Facebook Posts: Crawler executed in 22 seconds.
INFO [MapDBCrawlDataStore] Closing reference store: \temp\facebook-crawler-exam
ple/crawlstore/mapdb/Facebook_32_Posts/
INFO [JobSuite] Running Facebook Posts: END (Sun Jun 14 06:11:47 IRDT 2015)
I have Error and only 3 folder in "\temp\facebook-crawler-example":
crawlstore , logs , progress
The URL being invoked on the Facebook graph API to get the access token does not seem to work properly. Have you tested your client id and client secret manually against the Facebook Graph API? Make sure you can get an access token with those when connecting manually before trying in the connector.
You can also try to change the log level to DEBUG in the classes/log4j.properties file in case you get more information.
Thanks, problem have been solved. I have a file "1434350114406000000-add.cntnt" contain crawled data. how can I used it? I want query on it.
What you do with the data is yours to decide. The HTTP Collector is a web crawler, not a search engine or query platform. To do something meaningful with the content, you normally would use a Committer to push the content to the repository of your choice (database, search engine, etc). For instance, a common scenario is to use the Solr Committer to push the data to a Solr instance and make queries on that search engine. You can also create your own Committer.
thanks, I'm PhD student. I work on dissemination of information in social networks . I need to get posts and profiles information from users in social network and query on it.
On Thu, Jun 18, 2015 at 7:44 AM, Pascal Essiembre notifications@github.com wrote:
What you do with the data is yours to decide. The HTTP Collector is a web crawler, not a search engine or query platform. To do something meaningful with the content, you normally would use a Committer to push the content to the repository of your choice (database, search engine, etc). For instance, a common scenario is to use the Solr Committer http://www.norconex.com/collectors/committer-solr/ to push the data to a Solr instance and make queries on that search engine. You can also create your own Committer http://www.norconex.com/collectors/committer-core/create-your-own.
— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/117#issuecomment-113023836 .
E.Bagheri
Please create a new issue if you have additional questions.
How do I open mapdb files?