chrismattmann / nutch-python

Nutch-Python is a Python binding to the Apache Nutch™ REST services allowing Nutch to be called natively in the Python community. — Edit
Apache License 2.0
38 stars 20 forks source link

Unable to initialize the Nutch object #4

Closed antrikss closed 9 years ago

antrikss commented 9 years ago

I used the following command to initialize the Nutch object.

nt = Nutch('crawlTest', urlDir='urls/', serverEndpoint='http://localhost:8081')

But it gave me the following error

nutch.py: GET Endpoint: /config/crawlTest
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 10:26:13 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {}
Traceback (most recent call last):
  File "/Users/Antrromet/Documents/workspace/Nutch/test_nutch_python.py", line 3, in <module>
    nt = Nutch('crawlTest', urlDir='urls/', serverEndpoint='http://localhost:8081')
  File "build/bdist.macosx-10.11-intel/egg/nutch/nutch.py", line 609, in __init__
  File "build/bdist.macosx-10.11-intel/egg/nutch/nutch.py", line 302, in __getitem__
KeyError

Ideally, the above should have worked, because it should have used the default configuration, and should have been able to find it. But unfortunately, it doesn't and throws the KeyError.

I even tried explicitly giving the default config (although it doesn't matter because its the default param) but in vain.

nt = Nutch('crawlTest', confId='default', urlDir='urls/', serverEndpoint='http://localhost:8081')

The above gave me the following error.

Traceback (most recent call last):
  File "/Users/Antrromet/Documents/workspace/Nutch/test_nutch_python.py", line 3, in <module>

    nt = Nutch('crawlTest', confId='default', urlDir='urls/', serverEndpoint='http://localhost:8081')

TypeError: __init__() got multiple values for keyword argument 'confId'
chrismattmann commented 9 years ago

@ahmadia any ideas?

ahmadia commented 9 years ago

@antrikss - Thanks for the report!

What version of Nutch are you running against? What's the output of the server? What happens when you run:

nt = Nutch()

Which tests from py.test pass/fail?

antrikss commented 9 years ago

Hi @ahmadia , thanks for the prompt reply! I'm using Nutch version 1.11-SNAPSHOT. I'm sorry I do not understand what do you mean by output of the server. And when I run nt = Nutch() I get the following response

nutch.py: GET Endpoint: /config/default
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 19:22:42 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {u'store.http.headers': u'false', u'fetcher.max.crawl.delay': u'30', u'anchorIndexingFilter.deduplicate': u'false', u'ha.health-monitor.sleep-after-disconnect.ms': u'1000', u'http.verbose': u'false', u'parser.character.encoding.default': u'windows-1252', u'file.crawl.parent': u'true', u'fs.client.resolve.remote.symlinks': u'true', u'tika.uppercase.element.names': u'true', u'hadoop.user.group.static.mapping.overrides': u'dr.who=;', u's3.blocksize': u'67108864', u'ftp.timeout': u'60000', u'headings': u'h1,h2', u'fetcher.threads.timeout.divisor': u'2', u'http.agent.version': u'Nutch-1.11-SNAPSHOT', u'fetcher.threads.per.queue': u'1', u'generate.min.interval': u'-1', u'fetcher.timelimit.mins': u'-1', u'ftp.stream-buffer-size': u'4096', u'hadoop.http.authentication.token.validity': u'36000', u'indexer.score.power': u'0.5', u'fetcher.queue.depth.multiplier': u'50', u'ftp.replication': u'3', u'urlfilter.prefix.file': u'prefix-urlfilter.txt', u'fetcher.bandwidth.target': u'-1', u'http.accept': u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', u'ha.health-monitor.check-interval.ms': u'1000', u'ipc.client.idlethreshold': u'4000', u'io.file.buffer.size': u'4096', u'ipc.server.tcpnodelay': u'false', u'hadoop.security.group.mapping.ldap.ssl': u'false', u's3native.bytes-per-checksum': u'512', u'io.mapfile.bloom.size': u'1048576', u'ftp.keep.connection': u'false', u'hadoop.security.authentication': u'simple', u'db.fetch.schedule.adaptive.sync_delta_rate': u'0.3', u'hadoop.http.authentication.kerberos.principal': u'HTTP/_HOST@LOCALHOST', u'hadoop.security.groups.cache.secs': u'300', u'db.fetch.schedule.adaptive.sync_delta': u'true', u'link.ignore.internal.domain': u'true', u'io.seqfile.sorter.recordlimit': u'1000000', u'hadoop.ssl.enabled': u'false', u'fetcher.server.min.delay': u'0.0', u'hadoop.security.group.mapping.ldap.search.filter.user': u'(&(objectClass=user)(sAMAccountName={0}))', u'fetcher.throughput.threshold.retries': u'5', u'parse.filter.urls': u'true', u'fs.s3n.multipart.uploads.block.size': u'67108864', u'selenium.take.screenshot': u'false', u'link.analyze.damping.factor': u'0.85f', u'fs.trash.checkpoint.interval': u'0', u's3.replication': u'3', u'encodingdetector.charset.min.confidence': u'-1', u'db.url.normalizers': u'false', u'generate.count.mode': u'host', u'metatags.names': u'description,keywords', u'db.ignore.external.links': u'false', u'solr.commit.size': u'250', u'hadoop.security.group.mapping.ldap.search.attr.member': u'member', u'parser.html.form.use_action': u'false', u's3native.stream-buffer-size': u'4096', u'mime.type.magic': u'true', u'selenium.grid.driver': u'firefox', u'indexer.skip.notmodified': u'false', u'hadoop.http.authentication.simple.anonymous.allowed': u'true', u'db.signature.class': u'org.apache.nutch.crawl.MD5Signature', u'hadoop.security.groups.cache.warn.after.ms': u'5000', u'file.stream-buffer-size': u'4096', u'crawl.gen.delay': u'604800000', u'hadoop.security.group.mapping.ldap.directory.search.timeout': u'10000', u'link.ignore.limit.domain': u'true', u'hadoop.security.group.mapping.ldap.search.attr.group.name': u'cn', u'db.update.additions.allowed': u'true', u'fs.ftp.host': u'0.0.0.0', u'net.topology.impl': u'org.apache.hadoop.net.NetworkTopology', u'hadoop.rpc.socket.factory.class.default': u'org.apache.hadoop.net.StandardSocketFactory', u'fetcher.max.exceptions.per.queue': u'-1', u'generate.update.crawldb': u'false', u's3.client-write-packet-size': u'65536', u'fs.s3.maxRetries': u'4', u'lang.extraction.policy': u'detect,identify', u'subcollection.default.fieldname': u'subcollection', u'solr.server.type': u'http', u'parse.normalize.urls': u'true', u'hadoop.util.hash.type': u'murmur', u'solr.mapping.file': u'solrindex-mapping.xml', u'db.injector.overwrite': u'false', u'generate.max.count': u'-1', u'db.fetch.schedule.adaptive.dec_rate': u'0.2', u'file.replication': u'1', u'fetcher.maxNum.threads': u'25', u'link.score.updater.clear.score': u'0.0f', u'file.content.ignored': u'true', u'io.seqfile.local.dir': u'${hadoop.tmp.dir}/io/local', u'hadoop.tmp.dir': u'/tmp/hadoop-${user.name}', u'hadoop.ssl.hostname.verifier': u'DEFAULT', u'link.delete.gone': u'false', u'selenium.hub.host': u'localhost', u'generate.min.score': u'0', u'io.skip.checksum.errors': u'false', u'ha.failover-controller.cli-check.rpc-timeout.ms': u'20000', u'fs.s3n.multipart.copy.block.size': u'5368709120', u'ipc.client.connect.timeout': u'20000', u'hadoop.security.authorization': u'false', u'fetcher.store.content': u'true', u'io.map.index.skip': u'0', u'ipc.client.tcpnodelay': u'false', u'fs.s3n.multipart.uploads.enabled': u'false', u'db.ignore.internal.links': u'true', u'urlfilter.automaton.file': u'automaton-urlfilter.txt', u'hadoop.security.group.mapping.ldap.search.filter.group': u'(objectClass=group)', u'hadoop.rpc.protection': u'authentication', u'fs.AbstractFileSystem.viewfs.impl': u'org.apache.hadoop.fs.viewfs.ViewFs', u'ftp.blocksize': u'67108864', u'hadoop.security.group.mapping': u'org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback', u'http.robot.rules.whitelist': u'4chan.org/k/,academy.com,accurateshooter.com,advanced-armanent.com,americanlisted.com,arguntrader.com,armslist.com,backpage.com,budsgunshop.com,buyusedguns.net,cabelas.com,cheaperthandirt.com,davidsonsinc.com,firearmlist.com,firearmslist.com,freeclassifieds.com,freegunclassifieds.com,freegunclaXssifieds.com,gandermountain.com,gunauction.com,gunbroker.com,gundeals.org,gunlistings.org,gunsamerica.com,gunsinternational.com,guntrader.com,hipointfirearmsforums.com,impactguns.com,iwanna.com,lionseek.com,midwestguntrader.com,nationalguntrader.com,nextechclassifieds.com/categories/sporting-goods/firearms,oodle.com,recycler.com,shooterswap.com,shooting.org,slickguns.com,wantaddigest.com,wikiarms.com/guns,abqjournal.com,alaskaslist.com,billingsthriftynickel.com,carolinabargaintrader.net,clasificadosphoenix.univision.com,classifiednc.com,classifieds.al.com,cologunmarket.com,comprayventadearms.com,dallasguns.com,elpasoguntrader.com,fhclassifieds.com,floridagunclassifieds.com,floridaguntrader.com,gowilkes.com,gunidaho.com,hawaiiguntrader.com,idahogunsforsale.com,iguntrade.com,jasonsguns.com,ksl.com,kyclassifieds.com,midutahradio.com/tradio,midwestgtrader.com,montanagunclassifieds.com,montanagunsforsale.com,mountaintrader.com,msguntrader.com,ncgunads.com,newmexicoguntrader.com,nextechclassifieds.com,sanjoseguntrader.com,tell-n-sell.com,tennesseegunexchange.com,theoutdoorstrader.com,tradesnsales.com,upstateguntrader.com,vci-classifieds.com,zidaho.com', u'urlnormalizer.loop.count': u'1', u'fetcher.throughput.threshold.pages': u'-1', u'http.store.responsetime': u'true', u'moreIndexingFilter.mapMimeTypes': u'false', u'db.signature.text_profile.min_token_len': u'2', u'db.score.link.external': u'1.0', u'rpc.metrics.quantile.enable': u'false', u'link.ignore.internal.host': u'true', u'ha.failover-controller.graceful-fence.rpc-timeout.ms': u'5000', u'fs.defaultFS': u'file:///', u'io.mapfile.bloom.error.rate': u'0.005', u'http.agent.rotate': u'true', u'http.agent.rotate.file': u'agent.names.txt', u'file.crawl.redirect_noncanonical': u'true', u'hadoop.http.staticuser.user': u'dr.who', u'fetcher.throughput.threshold.check.after': u'5', u'ha.zookeeper.acl': u'world:anyone:rwcda', u'mapreduce.fileoutputcommitter.marksuccessfuljobs': u'false', u'mimetype.filter.file': u'mimetype-filter.txt', u'index.static.fieldsep': u',', u'io.native.lib.available': u'true', u'fs.df.interval': u'60000', u'parser.skip.truncated': u'true', u'fs.AbstractFileSystem.file.impl': u'org.apache.hadoop.fs.local.LocalFs', u'db.max.outlinks.per.page': u'-1', u'urlfilter.domain.file': u'domain-urlfilter.txt', u'interactiveselenium.handlers': u'DefaultHandler', u's3native.client-write-packet-size': u'65536', u'partition.url.mode': u'byHost', u'libselenium.page.load.delay': u'3', u'selenium.driver': u'firefox', u'tfile.fs.input.buffer.size': u'262144', u'ha.failover-controller.new-active.rpc-timeout.ms': u'60000', u'db.max.inlinks': u'10000', u'parser.timeout': u'30', u'db.fetch.schedule.adaptive.inc_rate': u'0.4', u'db.max.anchor.length': u'100', u'solr.auth': u'false', u'scoring.depth.max': u'1000', u'tfile.fs.output.buffer.size': u'262144', u'headings.multivalued': u'false', u'ftp.follow.talk': u'false', u'urlnormalizer.order': u'org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer', u'db.fetch.interval.max': u'7776000', u'ipc.server.listen.queue.size': u'128', u's3.bytes-per-checksum': u'512', u'hadoop.ssl.server.conf': u'ssl-server.xml', u'link.analyze.num.iterations': u'10', u's3.stream-buffer-size': u'4096', u'elastic.max.bulk.size': u'2500500', u'parser.html.impl': u'neko', u'ipc.client.connect.max.retries.on.timeouts': u'45', u'fs.trash.interval': u'0', u'index.static.keysep': u':', u'solr.server.url': u'http://127.0.0.1:8983/solr/', u'db.signature.text_profile.quant_rate': u'0.01', u'indexer.add.domain': u'false', u'fs.AbstractFileSystem.hdfs.impl': u'org.apache.hadoop.fs.Hdfs', u'hadoop.common.configuration.version': u'0.23.0', u'fetcher.parse': u'false', u'http.timeout': u'10000', u'plugin.folders': u'plugins', u'http.accept.language': u'en-us,en-gb,en;q=0.7,*;q=0.3', u'fetcher.follow.outlinks.depth': u'-1', u'index.static.valuesep': u' ', u'ftp.bytes-per-checksum': u'512', u'ftp.username': u'anonymous', u'io.bytes.per.checksum': u'512', u'ipc.client.kill.max': u'10', u'index.parse.md': u'metatag.description,metatag.keywords', u'file.client-write-packet-size': u'65536', u'http.content.limit': u'10485760', u'ftp.password': u'anonymous@example.com', u'hadoop.job.history.user.location': u'${hadoop.log.dir}/history/user', u'indexer.max.content.length': u'-1', u'fetcher.server.delay': u'5.0', u'ha.zookeeper.parent-znode': u'/hadoop-ha', u'parse.plugin.file': u'parse-plugins.xml', u'link.ignore.limit.page': u'true', u'urlfilter.suffix.file': u'suffix-urlfilter.txt', u'hadoop.http.authentication.kerberos.keytab': u'${user.home}/hadoop.keytab', u'selenium.hub.path': u'/wd/hub', u'store.http.request': u'false', u'ipc.client.connect.max.retries': u'10', u'db.preserve.backup': u'true', u's3native.blocksize': u'67108864', u'http.max.delays': u'100', u'dfs.ha.fencing.ssh.connect-timeout': u'30000', u'lang.identification.only.certain': u'false', u'elastic.index': u'nutch', u'http.useHttp11': u'false', u'ha.health-monitor.connect-retry-interval.ms': u'1000', u'io.seqfile.compress.blocksize': u'1000000', u's3native.replication': u'3', u'io.compression.codec.bzip2.library': u'system-native', u'hadoop.ssl.keystores.factory.class': u'org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory', u'parser.caching.forbidden.policy': u'content', u'ftp.server.timeout': u'100000', u'hadoop.kerberos.kinit.command': u'kinit', u'net.topology.node.switch.mapping.impl': u'org.apache.hadoop.net.ScriptBasedMapping', u'moreIndexingFilter.indexMimeTypeParts': u'true', u'store.ip.address': u'false', u'io.map.index.interval': u'128', u'urlfilter.regex.file': u'regex-urlfilter.txt', u'hadoop.ssl.client.conf': u'ssl-client.xml', u'hadoop.security.instrumentation.requires.admin': u'false', u'db.fetch.schedule.adaptive.max_interval': u'31536000.0', u'ha.failover-controller.graceful-fence.connection.retries': u'1', u'link.analyze.initial.score': u'1.0f', u'nfs3.mountd.port': u'4242', u'fetcher.follow.outlinks.ignore.external': u'true', u'solr.commit.index': u'true', u'parsefilter.naivebayes.wordlist': u'naivebayes-wordlist.txt', u'hadoop.http.authentication.type': u'simple', u'hadoop.jetty.logs.serve.aliases': u'true', u'lang.analyze.max.length': u'2048', u'db.fetch.schedule.adaptive.min_interval': u'60.0', u'link.loops.depth': u'2', u'db.url.filters': u'false', u'selenium.hub.port': u'4444', u'db.update.max.inlinks': u'10000', u'hadoop.security.uid.cache.secs': u'14400', u'fetcher.follow.outlinks.depth.divisor': u'2', u'db.score.injected': u'1.0', u'file.content.limit': u'65536', u'db.update.purge.404': u'false', u'db.fetch.schedule.mime.file': u'adaptive-mimetypes.txt', u'urlnormalizer.regex.file': u'regex-normalize.xml', u'fetcher.verbose': u'false', u'nutch.conf.uuid': u'a85e84c6-30b7-4bc9-bb06-d530da475247', u'elastic.port': u'9300', u'fs.s3.block.size': u'67108864', u'fetcher.bandwidth.target.check.everyNSecs': u'30', u'fs.s3n.block.size': u'67108864', u'fs.s3.sleepTimeSeconds': u'10', u'net.topology.script.number.args': u'100', u'ha.health-monitor.rpc-timeout.ms': u'45000', u'elastic.max.bulk.docs': u'250', u'file.blocksize': u'67108864', u'db.injector.update': u'false', u'fs.permissions.umask-mode': u'022', u'io.serializations': u'org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization', u'http.agent.name': u'Team 24 Spider', u'tfile.io.chunk.size': u'1048576', u'ipc.client.connect.retry.interval': u'1000', u'hadoop.work.around.non.threadsafe.getpwuid': u'false', u'hadoop.http.filter.initializers': u'org.apache.hadoop.http.lib.StaticUserWebFilter', u'file.bytes-per-checksum': u'512', u'http.robots.403.allow': u'true', u'fetcher.follow.outlinks.num.links': u'4', u'fetcher.queue.mode': u'byHost', u'db.fetch.interval.default': u'2592000', u'db.fetch.retry.max': u'3', u'db.score.link.internal': u'1.0', u'io.seqfile.lazydecompress': u'true', u'http.auth.file': u'httpclient-auth.xml', u'http.redirect.max': u'0', u'plugin.auto-activation': u'true', u'fs.ftp.host.port': u'21', u'parsefilter.naivebayes.trainfile': u'naivebayes-train.txt', u'fs.swift.impl': u'org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem', u'ipc.client.fallback-to-simple-auth-allowed': u'false', u'http.enable.if.modified.since.header': u'true', u'fetcher.threads.fetch': u'10', u'hadoop.http.authentication.signature.secret.file': u'${user.home}/hadoop-http-auth-signature-secret', u'fs.automatic.close': u'true', u'fs.du.interval': u'600000', u'db.fetch.schedule.class': u'org.apache.nutch.crawl.DefaultFetchSchedule', u'ftp.client-write-packet-size': u'65536', u'selenium.hub.protocol': u'http', u'indexer.max.title.length': u'100', u'db.score.count.filtered': u'false', u'fs.s3.buffer.dir': u'${hadoop.tmp.dir}/s3', u'ftp.content.limit': u'65536', u'plugin.includes': u'protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)', u'ha.zookeeper.session-timeout.ms': u'5000', u'nfs3.server.port': u'2049', u'index.geoip.usage': u'insightsService', u'ipc.client.connection.maxidletime': u'10000', u'hadoop.ssl.require.client.cert': u'false'}
nutch.py: GET Endpoint: /config/default
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'date': 'Fri, 09 Oct 2015 19:22:42 GMT', 'transfer-encoding': 'chunked', 'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {u'store.http.headers': u'false', u'fetcher.max.crawl.delay': u'30', u'anchorIndexingFilter.deduplicate': u'false', u'ha.health-monitor.sleep-after-disconnect.ms': u'1000', u'http.verbose': u'false', u'parser.character.encoding.default': u'windows-1252', u'file.crawl.parent': u'true', u'fs.client.resolve.remote.symlinks': u'true', u'tika.uppercase.element.names': u'true', u'hadoop.user.group.static.mapping.overrides': u'dr.who=;', u's3.blocksize': u'67108864', u'ftp.timeout': u'60000', u'headings': u'h1,h2', u'fetcher.threads.timeout.divisor': u'2', u'http.agent.version': u'Nutch-1.11-SNAPSHOT', u'fetcher.threads.per.queue': u'1', u'generate.min.interval': u'-1', u'fetcher.timelimit.mins': u'-1', u'ftp.stream-buffer-size': u'4096', u'hadoop.http.authentication.token.validity': u'36000', u'indexer.score.power': u'0.5', u'fetcher.queue.depth.multiplier': u'50', u'ftp.replication': u'3', u'urlfilter.prefix.file': u'prefix-urlfilter.txt', u'fetcher.bandwidth.target': u'-1', u'http.accept': u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', u'ha.health-monitor.check-interval.ms': u'1000', u'ipc.client.idlethreshold': u'4000', u'io.file.buffer.size': u'4096', u'ipc.server.tcpnodelay': u'false', u'hadoop.security.group.mapping.ldap.ssl': u'false', u's3native.bytes-per-checksum': u'512', u'io.mapfile.bloom.size': u'1048576', u'ftp.keep.connection': u'false', u'hadoop.security.authentication': u'simple', u'db.fetch.schedule.adaptive.sync_delta_rate': u'0.3', u'hadoop.http.authentication.kerberos.principal': u'HTTP/_HOST@LOCALHOST', u'hadoop.security.groups.cache.secs': u'300', u'db.fetch.schedule.adaptive.sync_delta': u'true', u'link.ignore.internal.domain': u'true', u'io.seqfile.sorter.recordlimit': u'1000000', u'hadoop.ssl.enabled': u'false', u'fetcher.server.min.delay': u'0.0', u'hadoop.security.group.mapping.ldap.search.filter.user': u'(&(objectClass=user)(sAMAccountName={0}))', u'fetcher.throughput.threshold.retries': u'5', u'parse.filter.urls': u'true', u'fs.s3n.multipart.uploads.block.size': u'67108864', u'selenium.take.screenshot': u'false', u'link.analyze.damping.factor': u'0.85f', u'fs.trash.checkpoint.interval': u'0', u's3.replication': u'3', u'encodingdetector.charset.min.confidence': u'-1', u'db.url.normalizers': u'false', u'generate.count.mode': u'host', u'metatags.names': u'description,keywords', u'db.ignore.external.links': u'false', u'solr.commit.size': u'250', u'hadoop.security.group.mapping.ldap.search.attr.member': u'member', u'parser.html.form.use_action': u'false', u's3native.stream-buffer-size': u'4096', u'mime.type.magic': u'true', u'selenium.grid.driver': u'firefox', u'indexer.skip.notmodified': u'false', u'hadoop.http.authentication.simple.anonymous.allowed': u'true', u'db.signature.class': u'org.apache.nutch.crawl.MD5Signature', u'hadoop.security.groups.cache.warn.after.ms': u'5000', u'file.stream-buffer-size': u'4096', u'crawl.gen.delay': u'604800000', u'hadoop.security.group.mapping.ldap.directory.search.timeout': u'10000', u'link.ignore.limit.domain': u'true', u'hadoop.security.group.mapping.ldap.search.attr.group.name': u'cn', u'db.update.additions.allowed': u'true', u'fs.ftp.host': u'0.0.0.0', u'net.topology.impl': u'org.apache.hadoop.net.NetworkTopology', u'hadoop.rpc.socket.factory.class.default': u'org.apache.hadoop.net.StandardSocketFactory', u'fetcher.max.exceptions.per.queue': u'-1', u'generate.update.crawldb': u'false', u's3.client-write-packet-size': u'65536', u'fs.s3.maxRetries': u'4', u'lang.extraction.policy': u'detect,identify', u'subcollection.default.fieldname': u'subcollection', u'solr.server.type': u'http', u'parse.normalize.urls': u'true', u'hadoop.util.hash.type': u'murmur', u'solr.mapping.file': u'solrindex-mapping.xml', u'db.injector.overwrite': u'false', u'generate.max.count': u'-1', u'db.fetch.schedule.adaptive.dec_rate': u'0.2', u'file.replication': u'1', u'fetcher.maxNum.threads': u'25', u'link.score.updater.clear.score': u'0.0f', u'file.content.ignored': u'true', u'io.seqfile.local.dir': u'${hadoop.tmp.dir}/io/local', u'hadoop.tmp.dir': u'/tmp/hadoop-${user.name}', u'hadoop.ssl.hostname.verifier': u'DEFAULT', u'link.delete.gone': u'false', u'selenium.hub.host': u'localhost', u'generate.min.score': u'0', u'io.skip.checksum.errors': u'false', u'ha.failover-controller.cli-check.rpc-timeout.ms': u'20000', u'fs.s3n.multipart.copy.block.size': u'5368709120', u'ipc.client.connect.timeout': u'20000', u'hadoop.security.authorization': u'false', u'fetcher.store.content': u'true', u'io.map.index.skip': u'0', u'ipc.client.tcpnodelay': u'false', u'fs.s3n.multipart.uploads.enabled': u'false', u'db.ignore.internal.links': u'true', u'urlfilter.automaton.file': u'automaton-urlfilter.txt', u'hadoop.security.group.mapping.ldap.search.filter.group': u'(objectClass=group)', u'hadoop.rpc.protection': u'authentication', u'fs.AbstractFileSystem.viewfs.impl': u'org.apache.hadoop.fs.viewfs.ViewFs', u'ftp.blocksize': u'67108864', u'hadoop.security.group.mapping': u'org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback', u'http.robot.rules.whitelist': u'4chan.org/k/,academy.com,accurateshooter.com,advanced-armanent.com,americanlisted.com,arguntrader.com,armslist.com,backpage.com,budsgunshop.com,buyusedguns.net,cabelas.com,cheaperthandirt.com,davidsonsinc.com,firearmlist.com,firearmslist.com,freeclassifieds.com,freegunclassifieds.com,freegunclaXssifieds.com,gandermountain.com,gunauction.com,gunbroker.com,gundeals.org,gunlistings.org,gunsamerica.com,gunsinternational.com,guntrader.com,hipointfirearmsforums.com,impactguns.com,iwanna.com,lionseek.com,midwestguntrader.com,nationalguntrader.com,nextechclassifieds.com/categories/sporting-goods/firearms,oodle.com,recycler.com,shooterswap.com,shooting.org,slickguns.com,wantaddigest.com,wikiarms.com/guns,abqjournal.com,alaskaslist.com,billingsthriftynickel.com,carolinabargaintrader.net,clasificadosphoenix.univision.com,classifiednc.com,classifieds.al.com,cologunmarket.com,comprayventadearms.com,dallasguns.com,elpasoguntrader.com,fhclassifieds.com,floridagunclassifieds.com,floridaguntrader.com,gowilkes.com,gunidaho.com,hawaiiguntrader.com,idahogunsforsale.com,iguntrade.com,jasonsguns.com,ksl.com,kyclassifieds.com,midutahradio.com/tradio,midwestgtrader.com,montanagunclassifieds.com,montanagunsforsale.com,mountaintrader.com,msguntrader.com,ncgunads.com,newmexicoguntrader.com,nextechclassifieds.com,sanjoseguntrader.com,tell-n-sell.com,tennesseegunexchange.com,theoutdoorstrader.com,tradesnsales.com,upstateguntrader.com,vci-classifieds.com,zidaho.com', u'urlnormalizer.loop.count': u'1', u'fetcher.throughput.threshold.pages': u'-1', u'http.store.responsetime': u'true', u'moreIndexingFilter.mapMimeTypes': u'false', u'db.signature.text_profile.min_token_len': u'2', u'db.score.link.external': u'1.0', u'rpc.metrics.quantile.enable': u'false', u'link.ignore.internal.host': u'true', u'ha.failover-controller.graceful-fence.rpc-timeout.ms': u'5000', u'fs.defaultFS': u'file:///', u'io.mapfile.bloom.error.rate': u'0.005', u'http.agent.rotate': u'true', u'http.agent.rotate.file': u'agent.names.txt', u'file.crawl.redirect_noncanonical': u'true', u'hadoop.http.staticuser.user': u'dr.who', u'fetcher.throughput.threshold.check.after': u'5', u'ha.zookeeper.acl': u'world:anyone:rwcda', u'mapreduce.fileoutputcommitter.marksuccessfuljobs': u'false', u'mimetype.filter.file': u'mimetype-filter.txt', u'index.static.fieldsep': u',', u'io.native.lib.available': u'true', u'fs.df.interval': u'60000', u'parser.skip.truncated': u'true', u'fs.AbstractFileSystem.file.impl': u'org.apache.hadoop.fs.local.LocalFs', u'db.max.outlinks.per.page': u'-1', u'urlfilter.domain.file': u'domain-urlfilter.txt', u'interactiveselenium.handlers': u'DefaultHandler', u's3native.client-write-packet-size': u'65536', u'partition.url.mode': u'byHost', u'libselenium.page.load.delay': u'3', u'selenium.driver': u'firefox', u'tfile.fs.input.buffer.size': u'262144', u'ha.failover-controller.new-active.rpc-timeout.ms': u'60000', u'db.max.inlinks': u'10000', u'parser.timeout': u'30', u'db.fetch.schedule.adaptive.inc_rate': u'0.4', u'db.max.anchor.length': u'100', u'solr.auth': u'false', u'scoring.depth.max': u'1000', u'tfile.fs.output.buffer.size': u'262144', u'headings.multivalued': u'false', u'ftp.follow.talk': u'false', u'urlnormalizer.order': u'org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer', u'db.fetch.interval.max': u'7776000', u'ipc.server.listen.queue.size': u'128', u's3.bytes-per-checksum': u'512', u'hadoop.ssl.server.conf': u'ssl-server.xml', u'link.analyze.num.iterations': u'10', u's3.stream-buffer-size': u'4096', u'elastic.max.bulk.size': u'2500500', u'parser.html.impl': u'neko', u'ipc.client.connect.max.retries.on.timeouts': u'45', u'fs.trash.interval': u'0', u'index.static.keysep': u':', u'solr.server.url': u'http://127.0.0.1:8983/solr/', u'db.signature.text_profile.quant_rate': u'0.01', u'indexer.add.domain': u'false', u'fs.AbstractFileSystem.hdfs.impl': u'org.apache.hadoop.fs.Hdfs', u'hadoop.common.configuration.version': u'0.23.0', u'fetcher.parse': u'false', u'http.timeout': u'10000', u'plugin.folders': u'plugins', u'http.accept.language': u'en-us,en-gb,en;q=0.7,*;q=0.3', u'fetcher.follow.outlinks.depth': u'-1', u'index.static.valuesep': u' ', u'ftp.bytes-per-checksum': u'512', u'ftp.username': u'anonymous', u'io.bytes.per.checksum': u'512', u'ipc.client.kill.max': u'10', u'index.parse.md': u'metatag.description,metatag.keywords', u'file.client-write-packet-size': u'65536', u'http.content.limit': u'10485760', u'ftp.password': u'anonymous@example.com', u'hadoop.job.history.user.location': u'${hadoop.log.dir}/history/user', u'indexer.max.content.length': u'-1', u'fetcher.server.delay': u'5.0', u'ha.zookeeper.parent-znode': u'/hadoop-ha', u'parse.plugin.file': u'parse-plugins.xml', u'link.ignore.limit.page': u'true', u'urlfilter.suffix.file': u'suffix-urlfilter.txt', u'hadoop.http.authentication.kerberos.keytab': u'${user.home}/hadoop.keytab', u'selenium.hub.path': u'/wd/hub', u'store.http.request': u'false', u'ipc.client.connect.max.retries': u'10', u'db.preserve.backup': u'true', u's3native.blocksize': u'67108864', u'http.max.delays': u'100', u'dfs.ha.fencing.ssh.connect-timeout': u'30000', u'lang.identification.only.certain': u'false', u'elastic.index': u'nutch', u'http.useHttp11': u'false', u'ha.health-monitor.connect-retry-interval.ms': u'1000', u'io.seqfile.compress.blocksize': u'1000000', u's3native.replication': u'3', u'io.compression.codec.bzip2.library': u'system-native', u'hadoop.ssl.keystores.factory.class': u'org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory', u'parser.caching.forbidden.policy': u'content', u'ftp.server.timeout': u'100000', u'hadoop.kerberos.kinit.command': u'kinit', u'net.topology.node.switch.mapping.impl': u'org.apache.hadoop.net.ScriptBasedMapping', u'moreIndexingFilter.indexMimeTypeParts': u'true', u'store.ip.address': u'false', u'io.map.index.interval': u'128', u'urlfilter.regex.file': u'regex-urlfilter.txt', u'hadoop.ssl.client.conf': u'ssl-client.xml', u'hadoop.security.instrumentation.requires.admin': u'false', u'db.fetch.schedule.adaptive.max_interval': u'31536000.0', u'ha.failover-controller.graceful-fence.connection.retries': u'1', u'link.analyze.initial.score': u'1.0f', u'nfs3.mountd.port': u'4242', u'fetcher.follow.outlinks.ignore.external': u'true', u'solr.commit.index': u'true', u'parsefilter.naivebayes.wordlist': u'naivebayes-wordlist.txt', u'hadoop.http.authentication.type': u'simple', u'hadoop.jetty.logs.serve.aliases': u'true', u'lang.analyze.max.length': u'2048', u'db.fetch.schedule.adaptive.min_interval': u'60.0', u'link.loops.depth': u'2', u'db.url.filters': u'false', u'selenium.hub.port': u'4444', u'db.update.max.inlinks': u'10000', u'hadoop.security.uid.cache.secs': u'14400', u'fetcher.follow.outlinks.depth.divisor': u'2', u'db.score.injected': u'1.0', u'file.content.limit': u'65536', u'db.update.purge.404': u'false', u'db.fetch.schedule.mime.file': u'adaptive-mimetypes.txt', u'urlnormalizer.regex.file': u'regex-normalize.xml', u'fetcher.verbose': u'false', u'nutch.conf.uuid': u'a85e84c6-30b7-4bc9-bb06-d530da475247', u'elastic.port': u'9300', u'fs.s3.block.size': u'67108864', u'fetcher.bandwidth.target.check.everyNSecs': u'30', u'fs.s3n.block.size': u'67108864', u'fs.s3.sleepTimeSeconds': u'10', u'net.topology.script.number.args': u'100', u'ha.health-monitor.rpc-timeout.ms': u'45000', u'elastic.max.bulk.docs': u'250', u'file.blocksize': u'67108864', u'db.injector.update': u'false', u'fs.permissions.umask-mode': u'022', u'io.serializations': u'org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.JavaSerialization', u'http.agent.name': u'Team 24 Spider', u'tfile.io.chunk.size': u'1048576', u'ipc.client.connect.retry.interval': u'1000', u'hadoop.work.around.non.threadsafe.getpwuid': u'false', u'hadoop.http.filter.initializers': u'org.apache.hadoop.http.lib.StaticUserWebFilter', u'file.bytes-per-checksum': u'512', u'http.robots.403.allow': u'true', u'fetcher.follow.outlinks.num.links': u'4', u'fetcher.queue.mode': u'byHost', u'db.fetch.interval.default': u'2592000', u'db.fetch.retry.max': u'3', u'db.score.link.internal': u'1.0', u'io.seqfile.lazydecompress': u'true', u'http.auth.file': u'httpclient-auth.xml', u'http.redirect.max': u'0', u'plugin.auto-activation': u'true', u'fs.ftp.host.port': u'21', u'parsefilter.naivebayes.trainfile': u'naivebayes-train.txt', u'fs.swift.impl': u'org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem', u'ipc.client.fallback-to-simple-auth-allowed': u'false', u'http.enable.if.modified.since.header': u'true', u'fetcher.threads.fetch': u'10', u'hadoop.http.authentication.signature.secret.file': u'${user.home}/hadoop-http-auth-signature-secret', u'fs.automatic.close': u'true', u'fs.du.interval': u'600000', u'db.fetch.schedule.class': u'org.apache.nutch.crawl.DefaultFetchSchedule', u'ftp.client-write-packet-size': u'65536', u'selenium.hub.protocol': u'http', u'indexer.max.title.length': u'100', u'db.score.count.filtered': u'false', u'fs.s3.buffer.dir': u'${hadoop.tmp.dir}/s3', u'ftp.content.limit': u'65536', u'plugin.includes': u'protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)', u'ha.zookeeper.session-timeout.ms': u'5000', u'nfs3.server.port': u'2049', u'index.geoip.usage': u'insightsService', u'ipc.client.connection.maxidletime': u'10000', u'hadoop.ssl.require.client.cert': u'false'}

And I could work with the above nutch initialization, but my main aim was to give a different crawlId for my crawl. And the py.test passes all conditions.

================================================================================ test session starts ================================================================================
platform darwin -- Python 2.7.10, pytest-2.8.2, py-1.4.30, pluggy-0.3.1
rootdir: /Users/Antrromet/Documents/USC/Fall2015/IR/nutch-python, inifile: 
collected 15 items 

test_nutch.py ...............

============================================================================ 15 passed in 21.84 seconds =============================================================================
ahmadia commented 9 years ago

Okay, cool. I haven't tried to do what you're doing before, so I'll need to take a look.

antrromet commented 9 years ago

Just a note, nt = Nutch() calls the GET config/default twice. As you can see from the logs above. nutch.py: GET Endpoint: /config/default Am not sure if that is intended, but just wanted you to know that its not a copy paste error.

ahmadia commented 9 years ago

@antrromet - our documentation is out of date and needs to be updated. Refer to the test_nutch.py file for a complete tour of functionality. For now:

In [6]: Nutch?
Init signature: Nutch(self, confId='default', serverEndpoint='http://localhost:8081', raiseErrors=True, **args)
Docstring:      <no docstring>
Init docstring:
Nutch client for interacting with a Nutch instance over its REST API.

Constructor:

nt = Nutch()

Optional arguments:

confID - The name of the default configuration file to use, by default: nutch.DefaultConfig
serverEndpoint - The location of the Nutch server, by default: nutch.DefaultServerEndpoint
raiseErrors - raise exceptions if server response is not 200

Provides functions:
    server - getServerStatus, stopServer
    config - get and set parameters for this configuration
    job - get list of running jobs, get job metadata, stop/abort a job by id, and create a new job

To start a crawl job, use:
    Crawl() - or use the methods inject, generate, fetch, parse, updatedb in that order.

To run a crawl in one method, use:
-- nt = Nutch()
-- response, status = nt.crawl()

To override a confId, you'd need to create a configuration first. To use default:

nt = Nutch('default')

To use a custom configuration:

nt = Nutch() # a little wonky, we assume a configuration for interacting with Nutch (default here)
nt.Configs().create('custom_conf', {override_param: here})
nt = Nutch('custom_conf')
antrromet commented 9 years ago

Got it. Yes, this will work. But can you tell me a way to change the crawlId?

So, I tried the Rest APIs given here, for creating a job, and you can specify a parameter like "crawlId":"crawl01" The above APIs seems to work perfectly fine when I tried on a Rest client. But can this be done using nutch-python?

ahmadia commented 9 years ago

@antrromet - Our documentation is seriously lacking here, but you can create a custom JobClient with:

nt = Nutch()
jc = nt.Jobs('your_crawl_id')

You can then use the job client to submit jobs and with the CrawlClient as well. There are a few examples of using the job client in test.python.

antrromet commented 9 years ago

@ahmadia Got it Aron, thanks a lot! Really appreciate your help.

ahmadia commented 9 years ago

No problem. We need to fix our documentation :(

chrismattmann commented 9 years ago

@antrikss can you update our docs for this? Would you be willing to submit a PR?

chrismattmann commented 9 years ago

@antrromet

antrromet commented 9 years ago

@chrismattmann Sure thing. I'll look into it.

chrismattmann commented 9 years ago

@ayberk if you have time, would appreciate a PR

chrismattmann commented 9 years ago

See the wiki I think this takes care of it.