Closed antrikss closed 9 years ago
@ahmadia any ideas?
@antrikss - Thanks for the report!
What version of Nutch are you running against? What's the output of the server? What happens when you run:
nt = Nutch()
Which tests from py.test
pass/fail?
Hi @ahmadia , thanks for the prompt reply!
I'm using Nutch version 1.11-SNAPSHOT.
I'm sorry I do not understand what do you mean by output of the server
.
And when I run
nt = Nutch()
I get the following response
nutch.py: GET Endpoint: /config/default
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 19:22:42 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {u'store.http.headers': u'false', u'fetcher.max.crawl.delay': u'30', u'anchorIndexingFilter.deduplicate': u'false', u'ha.health-monitor.sleep-after-disconnect.ms': u'1000', u'http.verbose': u'false', u'parser.character.encoding.default': u'windows-1252', u'file.crawl.parent': u'true', u'fs.client.resolve.remote.symlinks': u'true', u'tika.uppercase.element.names': u'true', u'hadoop.user.group.static.mapping.overrides': u'dr.who=;', u's3.blocksize': u'67108864', u'ftp.timeout': u'60000', u'headings': u'h1,h2', u'fetcher.threads.timeout.divisor': u'2', u'http.agent.version': u'Nutch-1.11-SNAPSHOT', u'fetcher.threads.per.queue': u'1', u'generate.min.interval': u'-1', u'fetcher.timelimit.mins': u'-1', u'ftp.stream-buffer-size': u'4096', u'hadoop.http.authentication.token.validity': u'36000', u'indexer.score.power': u'0.5', u'fetcher.queue.depth.multiplier': u'50', u'ftp.replication': u'3', u'urlfilter.prefix.file': u'prefix-urlfilter.txt', u'fetcher.bandwidth.target': u'-1', u'http.accept': u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', u'ha.health-monitor.check-interval.ms': u'1000', u'ipc.client.idlethreshold': u'4000', u'io.file.buffer.size': u'4096', u'ipc.server.tcpnodelay': u'false', u'hadoop.security.group.mapping.ldap.ssl': u'false', u's3native.bytes-per-checksum': u'512', u'io.mapfile.bloom.size': u'1048576', u'ftp.keep.connection': u'false', u'hadoop.security.authentication': u'simple', u'db.fetch.schedule.adaptive.sync_delta_rate': u'0.3', u'hadoop.http.authentication.kerberos.principal': u'HTTP/_HOST@LOCALHOST', u'hadoop.security.groups.cache.secs': u'300', u'db.fetch.schedule.adaptive.sync_delta': u'true', u'link.ignore.internal.domain': u'true', u'io.seqfile.sorter.recordlimit': u'1000000', u'hadoop.ssl.enabled': u'false', u'fetcher.server.min.delay': u'0.0', u'hadoop.security.group.mapping.ldap.search.filter.user': u'(&(objectClass=user)(sAMAccountName={0}))', u'fetcher.throughput.threshold.retries': u'5', u'parse.filter.urls': u'true', u'fs.s3n.multipart.uploads.block.size': u'67108864', u'selenium.take.screenshot': u'false', u'link.analyze.damping.factor': u'0.85f', u'fs.trash.checkpoint.interval': u'0', u's3.replication': u'3', u'encodingdetector.charset.min.confidence': u'-1', u'db.url.normalizers': u'false', u'generate.count.mode': u'host', u'metatags.names': u'description,keywords', u'db.ignore.external.links': u'false', u'solr.commit.size': u'250', u'hadoop.security.group.mapping.ldap.search.attr.member': u'member', u'parser.html.form.use_action': u'false', u's3native.stream-buffer-size': u'4096', u'mime.type.magic': u'true', u'selenium.grid.driver': u'firefox', u'indexer.skip.notmodified': u'false', u'hadoop.http.authentication.simple.anonymous.allowed': u'true', u'db.signature.class': u'org.apache.nutch.crawl.MD5Signature', u'hadoop.security.groups.cache.warn.after.ms': u'5000', u'file.stream-buffer-size': u'4096', u'crawl.gen.delay': u'604800000', u'hadoop.security.group.mapping.ldap.directory.search.timeout': u'10000', u'link.ignore.limit.domain': u'true', u'hadoop.security.group.mapping.ldap.search.attr.group.name': u'cn', u'db.update.additions.allowed': u'true', u'fs.ftp.host': u'0.0.0.0', u'net.topology.impl': u'org.apache.hadoop.net.NetworkTopology', u'hadoop.rpc.socket.factory.class.default': u'org.apache.hadoop.net.StandardSocketFactory', u'fetcher.max.exceptions.per.queue': u'-1', u'generate.update.crawldb': u'false', u's3.client-write-packet-size': u'65536', u'fs.s3.maxRetries': u'4', u'lang.extraction.policy': u'detect,identify', u'subcollection.default.fieldname': u'subcollection', u'solr.server.type': u'http', u'parse.normalize.urls': u'true', u'hadoop.util.hash.type': u'murmur', u'solr.mapping.file': u'solrindex-mapping.xml', u'db.injector.overwrite': u'false', u'generate.max.count': u'-1', u'db.fetch.schedule.adaptive.dec_rate': u'0.2', u'file.replication': u'1', u'fetcher.maxNum.threads': u'25', u'link.score.updater.clear.score': u'0.0f', u'file.content.ignored': u'true', u'io.seqfile.local.dir': u'${hadoop.tmp.dir}/io/local', u'hadoop.tmp.dir': u'/tmp/hadoop-${user.name}', u'hadoop.ssl.hostname.verifier': u'DEFAULT', u'link.delete.gone': u'false', u'selenium.hub.host': u'localhost', u'generate.min.score': u'0', u'io.skip.checksum.errors': u'false', u'ha.failover-controller.cli-check.rpc-timeout.ms': u'20000', u'fs.s3n.multipart.copy.block.size': u'5368709120', u'ipc.client.connect.timeout': u'20000', u'hadoop.security.authorization': u'false', u'fetcher.store.content': u'true', u'io.map.index.skip': u'0', u'ipc.client.tcpnodelay': u'false', u'fs.s3n.multipart.uploads.enabled': u'false', u'db.ignore.internal.links': u'true', u'urlfilter.automaton.file': u'automaton-urlfilter.txt', u'hadoop.security.group.mapping.ldap.search.filter.group': u'(objectClass=group)', u'hadoop.rpc.protection': u'authentication', u'fs.AbstractFileSystem.viewfs.impl': u'org.apache.hadoop.fs.viewfs.ViewFs', u'ftp.blocksize': u'67108864', u'hadoop.security.group.mapping': u'org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback', u'http.robot.rules.whitelist': u'4chan.org/k/,academy.com,accurateshooter.com,advanced-armanent.com,americanlisted.com,arguntrader.com,armslist.com,backpage.com,budsgunshop.com,buyusedguns.net,cabelas.com,cheaperthandirt.com,davidsonsinc.com,firearmlist.com,firearmslist.com,freeclassifieds.com,freegunclassifieds.com,freegunclaXssifieds.com,gandermountain.com,gunauction.com,gunbroker.com,gundeals.org,gunlistings.org,gunsamerica.com,gunsinternational.com,guntrader.com,hipointfirearmsforums.com,impactguns.com,iwanna.com,lionseek.com,midwestguntrader.com,nationalguntrader.com,nextechclassifieds.com/categories/sporting-goods/firearms,oodle.com,recycler.com,shooterswap.com,shooting.org,slickguns.com,wantaddigest.com,wikiarms.com/guns,abqjournal.com,alaskaslist.com,billingsthriftynickel.com,carolinabargaintrader.net,clasificadosphoenix.univision.com,classifiednc.com,classifieds.al.com,cologunmarket.com,comprayventadearms.com,dallasguns.com,elpasoguntrader.com,fhclassifieds.com,floridagunclassifieds.com,floridaguntrader.com,gowilkes.com,gunidaho.com,hawaiiguntrader.com,idahogunsforsale.com,iguntrade.com,jasonsguns.com,ksl.com,kyclassifieds.com,midutahradio.com/tradio,midwestgtrader.com,montanagunclassifieds.com,montanagunsforsale.com,mountaintrader.com,msguntrader.com,ncgunads.com,newmexicoguntrader.com,nextechclassifieds.com,sanjoseguntrader.com,tell-n-sell.com,tennesseegunexchange.com,theoutdoorstrader.com,tradesnsales.com,upstateguntrader.com,vci-classifieds.com,zidaho.com', u'urlnormalizer.loop.count': u'1', u'fetcher.throughput.threshold.pages': u'-1', u'http.store.responsetime': u'true', u'moreIndexingFilter.mapMimeTypes': u'false', u'db.signature.text_profile.min_token_len': u'2', u'db.score.link.external': u'1.0', u'rpc.metrics.quantile.enable': u'false', u'link.ignore.internal.host': u'true', u'ha.failover-controller.graceful-fence.rpc-timeout.ms': u'5000', u'fs.defaultFS': u'file:///', u'io.mapfile.bloom.error.rate': u'0.005', u'http.agent.rotate': u'true', u'http.agent.rotate.file': u'agent.names.txt', u'file.crawl.redirect_noncanonical': u'true', u'hadoop.http.staticuser.user': u'dr.who', u'fetcher.throughput.threshold.check.after': u'5', u'ha.zookeeper.acl': u'world:anyone:rwcda', u'mapreduce.fileoutputcommitter.marksuccessfuljobs': u'false', u'mimetype.filter.file': u'mimetype-filter.txt', u'index.static.fieldsep': u',', u'io.native.lib.available': u'true', u'fs.df.interval': u'60000', u'parser.skip.truncated': u'true', u'fs.AbstractFileSystem.file.impl': u'org.apache.hadoop.fs.local.LocalFs', u'db.max.outlinks.per.page': u'-1', u'urlfilter.domain.file': u'domain-urlfilter.txt', u'interactiveselenium.handlers': u'DefaultHandler', u's3native.client-write-packet-size': u'65536', u'partition.url.mode': u'byHost', u'libselenium.page.load.delay': u'3', u'selenium.driver': u'firefox', u'tfile.fs.input.buffer.size': u'262144', u'ha.failover-controller.new-active.rpc-timeout.ms': u'60000', u'db.max.inlinks': u'10000', u'parser.timeout': u'30', u'db.fetch.schedule.adaptive.inc_rate': u'0.4', u'db.max.anchor.length': u'100', u'solr.auth': u'false', u'scoring.depth.max': u'1000', u'tfile.fs.output.buffer.size': u'262144', u'headings.multivalued': u'false', u'ftp.follow.talk': u'false', u'urlnormalizer.order': u'org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer', u'db.fetch.interval.max': u'7776000', u'ipc.server.listen.queue.size': u'128', u's3.bytes-per-checksum': u'512', u'hadoop.ssl.server.conf': u'ssl-server.xml', u'link.analyze.num.iterations': u'10', u's3.stream-buffer-size': u'4096', u'elastic.max.bulk.size': u'2500500', u'parser.html.impl': u'neko', u'ipc.client.connect.max.retries.on.timeouts': u'45', u'fs.trash.interval': u'0', u'index.static.keysep': u':', u'solr.server.url': u'http://127.0.0.1:8983/solr/', u'db.signature.text_profile.quant_rate': u'0.01', u'indexer.add.domain': u'false', u'fs.AbstractFileSystem.hdfs.impl': u'org.apache.hadoop.fs.Hdfs', u'hadoop.common.configuration.version': u'0.23.0', u'fetcher.parse': u'false', u'http.timeout': u'10000', u'plugin.folders': u'plugins', u'http.accept.language': u'en-us,en-gb,en;q=0.7,*;q=0.3', u'fetcher.follow.outlinks.depth': u'-1', u'index.static.valuesep': u' ', u'ftp.bytes-per-checksum': u'512', u'ftp.username': u'anonymous', u'io.bytes.per.checksum': u'512', u'ipc.client.kill.max': u'10', u'index.parse.md': u'metatag.description,metatag.keywords', u'file.client-write-packet-size': u'65536', u'http.content.limit': u'10485760', u'ftp.password': u'anonymous@example.com', u'hadoop.job.history.user.location': u'${hadoop.log.dir}/history/user', u'indexer.max.content.length': u'-1', u'fetcher.server.delay': u'5.0', u'ha.zookeeper.parent-znode': u'/hadoop-ha', u'parse.plugin.file': u'parse-plugins.xml', u'link.ignore.limit.page': u'true', u'urlfilter.suffix.file': u'suffix-urlfilter.txt', u'hadoop.http.authentication.kerberos.keytab': u'${user.home}/hadoop.keytab', u'selenium.hub.path': u'/wd/hub', u'store.http.request': u'false', u'ipc.client.connect.max.retries': u'10', u'db.preserve.backup': u'true', u's3native.blocksize': u'67108864', u'http.max.delays': u'100', u'dfs.ha.fencing.ssh.connect-timeout': u'30000', u'lang.identification.only.certain': u'false', u'elastic.index': u'nutch', u'http.useHttp11': u'false', u'ha.health-monitor.connect-retry-interval.ms': u'1000', u'io.seqfile.compress.blocksize': u'1000000', u's3native.replication': u'3', u'io.compression.codec.bzip2.library': u'system-native', u'hadoop.ssl.keystores.factory.class': u'org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory', u'parser.caching.forbidden.policy': u'content', u'ftp.server.timeout': u'100000', u'hadoop.kerberos.kinit.command': u'kinit', u'net.topology.node.switch.mapping.impl': u'org.apache.hadoop.net.ScriptBasedMapping', u'moreIndexingFilter.indexMimeTypeParts': u'true', u'store.ip.address': u'false', u'io.map.index.interval': u'128', u'urlfilter.regex.file': u'regex-urlfilter.txt', u'hadoop.ssl.client.conf': u'ssl-client.xml', u'hadoop.security.instrumentation.requires.admin': u'false', u'db.fetch.schedule.adaptive.max_interval': u'31536000.0', u'ha.failover-controller.graceful-fence.connection.retries': u'1', u'link.analyze.initial.score': u'1.0f', u'nfs3.mountd.port': u'4242', u'fetcher.follow.outlinks.ignore.external': u'true', u'solr.commit.index': u'true', u'parsefilter.naivebayes.wordlist': u'naivebayes-wordlist.txt', u'hadoop.http.authentication.type': u'simple', u'hadoop.jetty.logs.serve.aliases': u'true', u'lang.analyze.max.length': u'2048', u'db.fetch.schedule.adaptive.min_interval': u'60.0', u'link.loops.depth': u'2', u'db.url.filters': u'false', u'selenium.hub.port': u'4444', u'db.update.max.inlinks': u'10000', u'hadoop.security.uid.cache.secs': u'14400', u'fetcher.follow.outlinks.depth.divisor': u'2', u'db.score.injected': u'1.0', u'file.content.limit': u'65536', u'db.update.purge.404': u'false', u'db.fetch.schedule.mime.file': u'adaptive-mimetypes.txt', u'urlnormalizer.regex.file': u'regex-normalize.xml', u'fetcher.verbose': u'false', u'nutch.conf.uuid': u'a85e84c6-30b7-4bc9-bb06-d530da475247', u'elastic.port': u'9300', u'fs.s3.block.size': u'67108864', u'fetcher.bandwidth.target.check.everyNSecs': u'30', u'fs.s3n.block.size': u'67108864', u'fs.s3.sleepTimeSeconds': u'10', u'net.topology.script.number.args': u'100', u'ha.health-monitor.rpc-timeout.ms': u'45000', u'elastic.max.bulk.docs': u'250', u'file.blocksize': u'67108864', u'db.injector.update': u'false', u'fs.permissions.umask-mode': u'022', u'io.serializations': u'org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization', u'http.agent.name': u'Team 24 Spider', u'tfile.io.chunk.size': u'1048576', u'ipc.client.connect.retry.interval': u'1000', u'hadoop.work.around.non.threadsafe.getpwuid': u'false', u'hadoop.http.filter.initializers': u'org.apache.hadoop.http.lib.StaticUserWebFilter', u'file.bytes-per-checksum': u'512', u'http.robots.403.allow': u'true', u'fetcher.follow.outlinks.num.links': u'4', u'fetcher.queue.mode': u'byHost', u'db.fetch.interval.default': u'2592000', u'db.fetch.retry.max': u'3', u'db.score.link.internal': u'1.0', u'io.seqfile.lazydecompress': u'true', u'http.auth.file': u'httpclient-auth.xml', u'http.redirect.max': u'0', u'plugin.auto-activation': u'true', u'fs.ftp.host.port': u'21', u'parsefilter.naivebayes.trainfile': u'naivebayes-train.txt', u'fs.swift.impl': u'org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem', u'ipc.client.fallback-to-simple-auth-allowed': u'false', u'http.enable.if.modified.since.header': u'true', u'fetcher.threads.fetch': u'10', u'hadoop.http.authentication.signature.secret.file': u'${user.home}/hadoop-http-auth-signature-secret', u'fs.automatic.close': u'true', u'fs.du.interval': u'600000', u'db.fetch.schedule.class': u'org.apache.nutch.crawl.DefaultFetchSchedule', u'ftp.client-write-packet-size': u'65536', u'selenium.hub.protocol': u'http', u'indexer.max.title.length': u'100', u'db.score.count.filtered': u'false', u'fs.s3.buffer.dir': u'${hadoop.tmp.dir}/s3', u'ftp.content.limit': u'65536', u'plugin.includes': u'protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)', u'ha.zookeeper.session-timeout.ms': u'5000', u'nfs3.server.port': u'2049', u'index.geoip.usage': u'insightsService', u'ipc.client.connection.maxidletime': u'10000', u'hadoop.ssl.require.client.cert': u'false'}
nutch.py: GET Endpoint: /config/default
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 19:22:42 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {u'store.http.headers': u'false', u'fetcher.max.crawl.delay': u'30', u'anchorIndexingFilter.deduplicate': u'false', u'ha.health-monitor.sleep-after-disconnect.ms': u'1000', u'http.verbose': u'false', u'parser.character.encoding.default': u'windows-1252', u'file.crawl.parent': u'true', u'fs.client.resolve.remote.symlinks': u'true', u'tika.uppercase.element.names': u'true', u'hadoop.user.group.static.mapping.overrides': u'dr.who=;', u's3.blocksize': u'67108864', u'ftp.timeout': u'60000', u'headings': u'h1,h2', u'fetcher.threads.timeout.divisor': u'2', u'http.agent.version': u'Nutch-1.11-SNAPSHOT', u'fetcher.threads.per.queue': u'1', u'generate.min.interval': u'-1', u'fetcher.timelimit.mins': u'-1', u'ftp.stream-buffer-size': u'4096', u'hadoop.http.authentication.token.validity': u'36000', u'indexer.score.power': u'0.5', u'fetcher.queue.depth.multiplier': u'50', u'ftp.replication': u'3', u'urlfilter.prefix.file': u'prefix-urlfilter.txt', u'fetcher.bandwidth.target': u'-1', u'http.accept': u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', u'ha.health-monitor.check-interval.ms': u'1000', u'ipc.client.idlethreshold': u'4000', u'io.file.buffer.size': u'4096', u'ipc.server.tcpnodelay': u'false', u'hadoop.security.group.mapping.ldap.ssl': u'false', u's3native.bytes-per-checksum': u'512', u'io.mapfile.bloom.size': u'1048576', u'ftp.keep.connection': u'false', u'hadoop.security.authentication': u'simple', u'db.fetch.schedule.adaptive.sync_delta_rate': u'0.3', u'hadoop.http.authentication.kerberos.principal': u'HTTP/_HOST@LOCALHOST', u'hadoop.security.groups.cache.secs': u'300', u'db.fetch.schedule.adaptive.sync_delta': u'true', u'link.ignore.internal.domain': u'true', u'io.seqfile.sorter.recordlimit': u'1000000', u'hadoop.ssl.enabled': u'false', u'fetcher.server.min.delay': u'0.0', u'hadoop.security.group.mapping.ldap.search.filter.user': u'(&(objectClass=user)(sAMAccountName={0}))', u'fetcher.throughput.threshold.retries': u'5', u'parse.filter.urls': u'true', u'fs.s3n.multipart.uploads.block.size': u'67108864', u'selenium.take.screenshot': u'false', u'link.analyze.damping.factor': u'0.85f', u'fs.trash.checkpoint.interval': u'0', u's3.replication': u'3', u'encodingdetector.charset.min.confidence': u'-1', u'db.url.normalizers': u'false', u'generate.count.mode': u'host', u'metatags.names': u'description,keywords', u'db.ignore.external.links': u'false', u'solr.commit.size': u'250', u'hadoop.security.group.mapping.ldap.search.attr.member': u'member', u'parser.html.form.use_action': u'false', u's3native.stream-buffer-size': u'4096', u'mime.type.magic': u'true', u'selenium.grid.driver': u'firefox', u'indexer.skip.notmodified': u'false', u'hadoop.http.authentication.simple.anonymous.allowed': u'true', u'db.signature.class': u'org.apache.nutch.crawl.MD5Signature', u'hadoop.security.groups.cache.warn.after.ms': u'5000', u'file.stream-buffer-size': u'4096', u'crawl.gen.delay': u'604800000', u'hadoop.security.group.mapping.ldap.directory.search.timeout': u'10000', u'link.ignore.limit.domain': u'true', u'hadoop.security.group.mapping.ldap.search.attr.group.name': u'cn', u'db.update.additions.allowed': u'true', u'fs.ftp.host': u'0.0.0.0', u'net.topology.impl': u'org.apache.hadoop.net.NetworkTopology', u'hadoop.rpc.socket.factory.class.default': u'org.apache.hadoop.net.StandardSocketFactory', u'fetcher.max.exceptions.per.queue': u'-1', u'generate.update.crawldb': u'false', u's3.client-write-packet-size': u'65536', u'fs.s3.maxRetries': u'4', u'lang.extraction.policy': u'detect,identify', u'subcollection.default.fieldname': u'subcollection', u'solr.server.type': u'http', u'parse.normalize.urls': u'true', u'hadoop.util.hash.type': u'murmur', u'solr.mapping.file': u'solrindex-mapping.xml', u'db.injector.overwrite': u'false', u'generate.max.count': u'-1', u'db.fetch.schedule.adaptive.dec_rate': u'0.2', u'file.replication': u'1', u'fetcher.maxNum.threads': u'25', u'link.score.updater.clear.score': u'0.0f', u'file.content.ignored': u'true', u'io.seqfile.local.dir': u'${hadoop.tmp.dir}/io/local', u'hadoop.tmp.dir': u'/tmp/hadoop-${user.name}', u'hadoop.ssl.hostname.verifier': u'DEFAULT', u'link.delete.gone': u'false', u'selenium.hub.host': u'localhost', u'generate.min.score': u'0', u'io.skip.checksum.errors': u'false', u'ha.failover-controller.cli-check.rpc-timeout.ms': u'20000', u'fs.s3n.multipart.copy.block.size': u'5368709120', u'ipc.client.connect.timeout': u'20000', u'hadoop.security.authorization': u'false', u'fetcher.store.content': u'true', u'io.map.index.skip': u'0', u'ipc.client.tcpnodelay': u'false', u'fs.s3n.multipart.uploads.enabled': u'false', u'db.ignore.internal.links': u'true', u'urlfilter.automaton.file': u'automaton-urlfilter.txt', u'hadoop.security.group.mapping.ldap.search.filter.group': u'(objectClass=group)', u'hadoop.rpc.protection': u'authentication', u'fs.AbstractFileSystem.viewfs.impl': u'org.apache.hadoop.fs.viewfs.ViewFs', u'ftp.blocksize': u'67108864', u'hadoop.security.group.mapping': u'org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback', u'http.robot.rules.whitelist': u'4chan.org/k/,academy.com,accurateshooter.com,advanced-armanent.com,americanlisted.com,arguntrader.com,armslist.com,backpage.com,budsgunshop.com,buyusedguns.net,cabelas.com,cheaperthandirt.com,davidsonsinc.com,firearmlist.com,firearmslist.com,freeclassifieds.com,freegunclassifieds.com,freegunclaXssifieds.com,gandermountain.com,gunauction.com,gunbroker.com,gundeals.org,gunlistings.org,gunsamerica.com,gunsinternational.com,guntrader.com,hipointfirearmsforums.com,impactguns.com,iwanna.com,lionseek.com,midwestguntrader.com,nationalguntrader.com,nextechclassifieds.com/categories/sporting-goods/firearms,oodle.com,recycler.com,shooterswap.com,shooting.org,slickguns.com,wantaddigest.com,wikiarms.com/guns,abqjournal.com,alaskaslist.com,billingsthriftynickel.com,carolinabargaintrader.net,clasificadosphoenix.univision.com,classifiednc.com,classifieds.al.com,cologunmarket.com,comprayventadearms.com,dallasguns.com,elpasoguntrader.com,fhclassifieds.com,floridagunclassifieds.com,floridaguntrader.com,gowilkes.com,gunidaho.com,hawaiiguntrader.com,idahogunsforsale.com,iguntrade.com,jasonsguns.com,ksl.com,kyclassifieds.com,midutahradio.com/tradio,midwestgtrader.com,montanagunclassifieds.com,montanagunsforsale.com,mountaintrader.com,msguntrader.com,ncgunads.com,newmexicoguntrader.com,nextechclassifieds.com,sanjoseguntrader.com,tell-n-sell.com,tennesseegunexchange.com,theoutdoorstrader.com,tradesnsales.com,upstateguntrader.com,vci-classifieds.com,zidaho.com', u'urlnormalizer.loop.count': u'1', u'fetcher.throughput.threshold.pages': u'-1', u'http.store.responsetime': u'true', u'moreIndexingFilter.mapMimeTypes': u'false', u'db.signature.text_profile.min_token_len': u'2', u'db.score.link.external': u'1.0', u'rpc.metrics.quantile.enable': u'false', u'link.ignore.internal.host': u'true', u'ha.failover-controller.graceful-fence.rpc-timeout.ms': u'5000', u'fs.defaultFS': u'file:///', u'io.mapfile.bloom.error.rate': u'0.005', u'http.agent.rotate': u'true', u'http.agent.rotate.file': u'agent.names.txt', u'file.crawl.redirect_noncanonical': u'true', u'hadoop.http.staticuser.user': u'dr.who', u'fetcher.throughput.threshold.check.after': u'5', u'ha.zookeeper.acl': u'world:anyone:rwcda', u'mapreduce.fileoutputcommitter.marksuccessfuljobs': u'false', u'mimetype.filter.file': u'mimetype-filter.txt', u'index.static.fieldsep': u',', u'io.native.lib.available': u'true', u'fs.df.interval': u'60000', u'parser.skip.truncated': u'true', u'fs.AbstractFileSystem.file.impl': u'org.apache.hadoop.fs.local.LocalFs', u'db.max.outlinks.per.page': u'-1', u'urlfilter.domain.file': u'domain-urlfilter.txt', u'interactiveselenium.handlers': u'DefaultHandler', u's3native.client-write-packet-size': u'65536', u'partition.url.mode': u'byHost', u'libselenium.page.load.delay': u'3', u'selenium.driver': u'firefox', u'tfile.fs.input.buffer.size': u'262144', u'ha.failover-controller.new-active.rpc-timeout.ms': u'60000', u'db.max.inlinks': u'10000', u'parser.timeout': u'30', u'db.fetch.schedule.adaptive.inc_rate': u'0.4', u'db.max.anchor.length': u'100', u'solr.auth': u'false', u'scoring.depth.max': u'1000', u'tfile.fs.output.buffer.size': u'262144', u'headings.multivalued': u'false', u'ftp.follow.talk': u'false', u'urlnormalizer.order': u'org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer', u'db.fetch.interval.max': u'7776000', u'ipc.server.listen.queue.size': u'128', u's3.bytes-per-checksum': u'512', u'hadoop.ssl.server.conf': u'ssl-server.xml', u'link.analyze.num.iterations': u'10', u's3.stream-buffer-size': u'4096', u'elastic.max.bulk.size': u'2500500', u'parser.html.impl': u'neko', u'ipc.client.connect.max.retries.on.timeouts': u'45', u'fs.trash.interval': u'0', u'index.static.keysep': u':', u'solr.server.url': u'http://127.0.0.1:8983/solr/', u'db.signature.text_profile.quant_rate': u'0.01', u'indexer.add.domain': u'false', u'fs.AbstractFileSystem.hdfs.impl': u'org.apache.hadoop.fs.Hdfs', u'hadoop.common.configuration.version': u'0.23.0', u'fetcher.parse': u'false', u'http.timeout': u'10000', u'plugin.folders': u'plugins', u'http.accept.language': u'en-us,en-gb,en;q=0.7,*;q=0.3', u'fetcher.follow.outlinks.depth': u'-1', u'index.static.valuesep': u' ', u'ftp.bytes-per-checksum': u'512', u'ftp.username': u'anonymous', u'io.bytes.per.checksum': u'512', u'ipc.client.kill.max': u'10', u'index.parse.md': u'metatag.description,metatag.keywords', u'file.client-write-packet-size': u'65536', u'http.content.limit': u'10485760', u'ftp.password': u'anonymous@example.com', u'hadoop.job.history.user.location': u'${hadoop.log.dir}/history/user', u'indexer.max.content.length': u'-1', u'fetcher.server.delay': u'5.0', u'ha.zookeeper.parent-znode': u'/hadoop-ha', u'parse.plugin.file': u'parse-plugins.xml', u'link.ignore.limit.page': u'true', u'urlfilter.suffix.file': u'suffix-urlfilter.txt', u'hadoop.http.authentication.kerberos.keytab': u'${user.home}/hadoop.keytab', u'selenium.hub.path': u'/wd/hub', u'store.http.request': u'false', u'ipc.client.connect.max.retries': u'10', u'db.preserve.backup': u'true', u's3native.blocksize': u'67108864', u'http.max.delays': u'100', u'dfs.ha.fencing.ssh.connect-timeout': u'30000', u'lang.identification.only.certain': u'false', u'elastic.index': u'nutch', u'http.useHttp11': u'false', u'ha.health-monitor.connect-retry-interval.ms': u'1000', u'io.seqfile.compress.blocksize': u'1000000', u's3native.replication': u'3', u'io.compression.codec.bzip2.library': u'system-native', u'hadoop.ssl.keystores.factory.class': u'org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory', u'parser.caching.forbidden.policy': u'content', u'ftp.server.timeout': u'100000', u'hadoop.kerberos.kinit.command': u'kinit', u'net.topology.node.switch.mapping.impl': u'org.apache.hadoop.net.ScriptBasedMapping', u'moreIndexingFilter.indexMimeTypeParts': u'true', u'store.ip.address': u'false', u'io.map.index.interval': u'128', u'urlfilter.regex.file': u'regex-urlfilter.txt', u'hadoop.ssl.client.conf': u'ssl-client.xml', u'hadoop.security.instrumentation.requires.admin': u'false', u'db.fetch.schedule.adaptive.max_interval': u'31536000.0', u'ha.failover-controller.graceful-fence.connection.retries': u'1', u'link.analyze.initial.score': u'1.0f', u'nfs3.mountd.port': u'4242', u'fetcher.follow.outlinks.ignore.external': u'true', u'solr.commit.index': u'true', u'parsefilter.naivebayes.wordlist': u'naivebayes-wordlist.txt', u'hadoop.http.authentication.type': u'simple', u'hadoop.jetty.logs.serve.aliases': u'true', u'lang.analyze.max.length': u'2048', u'db.fetch.schedule.adaptive.min_interval': u'60.0', u'link.loops.depth': u'2', u'db.url.filters': u'false', u'selenium.hub.port': u'4444', u'db.update.max.inlinks': u'10000', u'hadoop.security.uid.cache.secs': u'14400', u'fetcher.follow.outlinks.depth.divisor': u'2', u'db.score.injected': u'1.0', u'file.content.limit': u'65536', u'db.update.purge.404': u'false', u'db.fetch.schedule.mime.file': u'adaptive-mimetypes.txt', u'urlnormalizer.regex.file': u'regex-normalize.xml', u'fetcher.verbose': u'false', u'nutch.conf.uuid': u'a85e84c6-30b7-4bc9-bb06-d530da475247', u'elastic.port': u'9300', u'fs.s3.block.size': u'67108864', u'fetcher.bandwidth.target.check.everyNSecs': u'30', u'fs.s3n.block.size': u'67108864', u'fs.s3.sleepTimeSeconds': u'10', u'net.topology.script.number.args': u'100', u'ha.health-monitor.rpc-timeout.ms': u'45000', u'elastic.max.bulk.docs': u'250', u'file.blocksize': u'67108864', u'db.injector.update': u'false', u'fs.permissions.umask-mode': u'022', u'io.serializations': u'org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization', u'http.agent.name': u'Team 24 Spider', u'tfile.io.chunk.size': u'1048576', u'ipc.client.connect.retry.interval': u'1000', u'hadoop.work.around.non.threadsafe.getpwuid': u'false', u'hadoop.http.filter.initializers': u'org.apache.hadoop.http.lib.StaticUserWebFilter', u'file.bytes-per-checksum': u'512', u'http.robots.403.allow': u'true', u'fetcher.follow.outlinks.num.links': u'4', u'fetcher.queue.mode': u'byHost', u'db.fetch.interval.default': u'2592000', u'db.fetch.retry.max': u'3', u'db.score.link.internal': u'1.0', u'io.seqfile.lazydecompress': u'true', u'http.auth.file': u'httpclient-auth.xml', u'http.redirect.max': u'0', u'plugin.auto-activation': u'true', u'fs.ftp.host.port': u'21', u'parsefilter.naivebayes.trainfile': u'naivebayes-train.txt', u'fs.swift.impl': u'org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem', u'ipc.client.fallback-to-simple-auth-allowed': u'false', u'http.enable.if.modified.since.header': u'true', u'fetcher.threads.fetch': u'10', u'hadoop.http.authentication.signature.secret.file': u'${user.home}/hadoop-http-auth-signature-secret', u'fs.automatic.close': u'true', u'fs.du.interval': u'600000', u'db.fetch.schedule.class': u'org.apache.nutch.crawl.DefaultFetchSchedule', u'ftp.client-write-packet-size': u'65536', u'selenium.hub.protocol': u'http', u'indexer.max.title.length': u'100', u'db.score.count.filtered': u'false', u'fs.s3.buffer.dir': u'${hadoop.tmp.dir}/s3', u'ftp.content.limit': u'65536', u'plugin.includes': u'protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)', u'ha.zookeeper.session-timeout.ms': u'5000', u'nfs3.server.port': u'2049', u'index.geoip.usage': u'insightsService', u'ipc.client.connection.maxidletime': u'10000', u'hadoop.ssl.require.client.cert': u'false'}
And I could work with the above nutch initialization, but my main aim was to give a different crawlId for my crawl.
And the py.test
passes all conditions.
================================================================================ test session starts ================================================================================
platform darwin -- Python 2.7.10, pytest-2.8.2, py-1.4.30, pluggy-0.3.1
rootdir: /Users/Antrromet/Documents/USC/Fall2015/IR/nutch-python, inifile:
collected 15 items
test_nutch.py ...............
============================================================================ 15 passed in 21.84 seconds =============================================================================
Okay, cool. I haven't tried to do what you're doing before, so I'll need to take a look.
Just a note,
nt = Nutch()
calls the GET config/default twice. As you can see from the logs above.
nutch.py: GET Endpoint: /config/default
Am not sure if that is intended, but just wanted you to know that its not a copy paste error.
@antrromet - our documentation is out of date and needs to be updated. Refer to the test_nutch.py
file for a complete tour of functionality. For now:
In [6]: Nutch?
Init signature: Nutch(self, confId='default', serverEndpoint='http://localhost:8081', raiseErrors=True, **args)
Docstring: <no docstring>
Init docstring:
Nutch client for interacting with a Nutch instance over its REST API.
Constructor:
nt = Nutch()
Optional arguments:
confID - The name of the default configuration file to use, by default: nutch.DefaultConfig
serverEndpoint - The location of the Nutch server, by default: nutch.DefaultServerEndpoint
raiseErrors - raise exceptions if server response is not 200
Provides functions:
server - getServerStatus, stopServer
config - get and set parameters for this configuration
job - get list of running jobs, get job metadata, stop/abort a job by id, and create a new job
To start a crawl job, use:
Crawl() - or use the methods inject, generate, fetch, parse, updatedb in that order.
To run a crawl in one method, use:
-- nt = Nutch()
-- response, status = nt.crawl()
To override a confId, you'd need to create a configuration first. To use default:
nt = Nutch('default')
To use a custom configuration:
nt = Nutch() # a little wonky, we assume a configuration for interacting with Nutch (default here)
nt.Configs().create('custom_conf', {override_param: here})
nt = Nutch('custom_conf')
Got it. Yes, this will work. But can you tell me a way to change the crawlId?
So, I tried the Rest APIs given here, for creating a job, and you can specify a parameter like
"crawlId":"crawl01"
The above APIs seems to work perfectly fine when I tried on a Rest client.
But can this be done using nutch-python
?
@antrromet - Our documentation is seriously lacking here, but you can create a custom JobClient with:
nt = Nutch()
jc = nt.Jobs('your_crawl_id')
You can then use the job client to submit jobs and with the CrawlClient
as well. There are a few examples of using the job client in test.python
.
@ahmadia Got it Aron, thanks a lot! Really appreciate your help.
No problem. We need to fix our documentation :(
@antrikss can you update our docs for this? Would you be willing to submit a PR?
@antrromet
@chrismattmann Sure thing. I'll look into it.
@ayberk if you have time, would appreciate a PR
See the wiki I think this takes care of it.
I used the following command to initialize the Nutch object.
But it gave me the following error
Ideally, the above should have worked, because it should have used the default configuration, and should have been able to find it. But unfortunately, it doesn't and throws the
KeyError
.I even tried explicitly giving the default config (although it doesn't matter because its the default param) but in vain.
The above gave me the following error.