CalculatedContent / cloud-crawler

Distributed Ruby Web Crawler, backed up by Redis
122 stars 37 forks source link

Could not run the test crawl #21

Open madankb opened 10 years ago

madankb commented 10 years ago

1) Started redis-server

2) bundle exec ./test/test_crawl.rb -u http://calculatedcontent.com gives below mentioned error. /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser/scanner.rb:19:in process': Sourcify::NoMatchingProcError (Sourcify::NoMatchingProcError) from cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:40:inextracted_source' from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:22:in sexp' from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:17:insource' from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/methods/to_source.rb:39:in to_source' from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:234:incrawl' from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:49:in standalone_crawl' from ./test/test_crawl.rb:27:in

'

I am using ruby version 2.1.1.

charlesmartin14 commented 10 years ago

it looks like sourcify is not working properly under ruby 2.1.1

we need to check that this works properly

https://github.com/CalculatedContent/sourcify

or see if we need to migrate to a newer version

https://github.com/ngty/sourcify

the basic design pattern for the crawler is described here

http://charlesmartin14.wordpress.com/2013/08/10/a-ruby-dsl-design-pattern-for-distributed-computing/

charlesmartin14 commented 10 years ago

the first step to do is write some small tests and verify that sourcify is working

madankb commented 10 years ago

I updated my sourcify gem version from 0.5 to 0.6. Then I ran the below mentioned test programs

1:-

require 'sourcify'

def block_to_s(&blk) blk.to_source(:strip_enclosure => true) end

puts block_to_s { str = "Hello" str.reverse! print str }

Output:-

str = "Hello" str.reverse! print(str)

2:-

require 'rubygems' require 'bundler/setup' require 'cloud-crawler' require 'trollop'

opts = Trollop::options do opt :urls, "urls to crawl", :short => "-u", :multi => true, :default => "http://www.ehow.com" end

urls = ["http://www.crossfit.com"] CloudCrawler::crawl(urls, opts) do |cc| cc.focus_crawl do |page| page.links.keep_if do |lnk| text_for(lnk) =~ /Level 1/i end end cc.on_every_page do |page| puts page.url.to_s end end

Output :-

/.rvm/gems/ruby-2.1.1@global/gems/bundler-1.5.3/lib/bundler/runtime.rb:220: warning: Insecure world writable dir /usr/local in PATH, mode 040777 /.rvm/gems/ruby-2.1.1@global/gems/bundler-1.5.3/lib/bundler/runtime.rb:220: warning: Insecure world writable dir /usr/local in PATH, mode 040777 I, [2014-05-12T22:26:56.313418 #3636] INFO -- : crawl ["http://www.crossfit.com"] with proc do |cc| cc.focus_crawl do |page| page.links.keep_if { |lnk| text_for(lnk) =~ /Level 1/i } end cc.on_every_page { |page| puts(page.url.to_s) } end I, [2014-05-12T22:26:56.319176 #3636] INFO -- : initialzing driver for cc I, [2014-05-12T22:26:56.319305 #3636] INFO -- : loading crawl job = {:url=>"http://www.crossfit.com"} I, [2014-05-12T22:26:56.327747 #3636] INFO -- : keys on ccmq ["dsl_blocks:2", "auto_dsl_id", "dsl_blocks:1"] I, [2014-05-12T22:26:56.327813 #3636] INFO -- : submitting CloudCrawler::CrawlJob single (non recurring) job

Previously I was getting error with sourcify version 0.5. I am still facing the same error with test_crawl.rb.

charlesmartin14 commented 10 years ago

The sourcify gems probably don't work . We used our own , forked version of sourcify because of this, although it might not be working properly in ruby 2.1

Ill see if I can reproduce the error

charlesmartin14 commented 10 years ago

this is the forked version with the bug fixes

https://github.com/CalculatedContent/sourcify

this should be what bundler installs

madankb commented 10 years ago

I tried sourcify from both https://github.com/CalculatedContent/sourcify and https://github.com/ngty/sourcify (Changing the Gemfile). But I am getting the same error. I may need to try installing ruby 1.9.3.

charlesmartin14 commented 10 years ago

that is, it is necessary to move to ruby 2.1 so it is useful to look carefully at what is working and what is not

we need to isolate where the bug is is the bug in sourcify itself?

charlesmartin14 commented 10 years ago

but generally yes...the requirements are ruby 1.9.7

charlesmartin14 commented 10 years ago

to install 1.9.7, i suggest using rvm this makes it ver easy

lucaswxp commented 9 years ago

Same problem here, and I'm using ruby 1.9.7 with rvm.