Open madankb opened 10 years ago
it looks like sourcify is not working properly under ruby 2.1.1
we need to check that this works properly
https://github.com/CalculatedContent/sourcify
or see if we need to migrate to a newer version
https://github.com/ngty/sourcify
the basic design pattern for the crawler is described here
http://charlesmartin14.wordpress.com/2013/08/10/a-ruby-dsl-design-pattern-for-distributed-computing/
the first step to do is write some small tests and verify that sourcify is working
I updated my sourcify gem version from 0.5 to 0.6. Then I ran the below mentioned test programs
require 'sourcify'
def block_to_s(&blk) blk.to_source(:strip_enclosure => true) end
puts block_to_s { str = "Hello" str.reverse! print str }
Output:-
str = "Hello" str.reverse! print(str)
require 'rubygems' require 'bundler/setup' require 'cloud-crawler' require 'trollop'
opts = Trollop::options do opt :urls, "urls to crawl", :short => "-u", :multi => true, :default => "http://www.ehow.com" end
urls = ["http://www.crossfit.com"] CloudCrawler::crawl(urls, opts) do |cc| cc.focus_crawl do |page| page.links.keep_if do |lnk| text_for(lnk) =~ /Level 1/i end end cc.on_every_page do |page| puts page.url.to_s end end
Output :-
/.rvm/gems/ruby-2.1.1@global/gems/bundler-1.5.3/lib/bundler/runtime.rb:220: warning: Insecure world writable dir /usr/local in PATH, mode 040777 /.rvm/gems/ruby-2.1.1@global/gems/bundler-1.5.3/lib/bundler/runtime.rb:220: warning: Insecure world writable dir /usr/local in PATH, mode 040777 I, [2014-05-12T22:26:56.313418 #3636] INFO -- : crawl ["http://www.crossfit.com"] with proc do |cc| cc.focus_crawl do |page| page.links.keep_if { |lnk| text_for(lnk) =~ /Level 1/i } end cc.on_every_page { |page| puts(page.url.to_s) } end I, [2014-05-12T22:26:56.319176 #3636] INFO -- : initialzing driver for cc I, [2014-05-12T22:26:56.319305 #3636] INFO -- : loading crawl job = {:url=>"http://www.crossfit.com"} I, [2014-05-12T22:26:56.327747 #3636] INFO -- : keys on ccmq ["dsl_blocks:2", "auto_dsl_id", "dsl_blocks:1"] I, [2014-05-12T22:26:56.327813 #3636] INFO -- : submitting CloudCrawler::CrawlJob single (non recurring) job
Previously I was getting error with sourcify version 0.5. I am still facing the same error with test_crawl.rb.
The sourcify gems probably don't work . We used our own , forked version of sourcify because of this, although it might not be working properly in ruby 2.1
Ill see if I can reproduce the error
this is the forked version with the bug fixes
https://github.com/CalculatedContent/sourcify
this should be what bundler installs
I tried sourcify from both https://github.com/CalculatedContent/sourcify and https://github.com/ngty/sourcify (Changing the Gemfile). But I am getting the same error. I may need to try installing ruby 1.9.3.
that is, it is necessary to move to ruby 2.1 so it is useful to look carefully at what is working and what is not
we need to isolate where the bug is is the bug in sourcify itself?
but generally yes...the requirements are ruby 1.9.7
to install 1.9.7, i suggest using rvm this makes it ver easy
Same problem here, and I'm using ruby 1.9.7 with rvm.
1) Started redis-server
2) bundle exec ./test/test_crawl.rb -u http://calculatedcontent.com gives below mentioned error. /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser/scanner.rb:19:in'
process': Sourcify::NoMatchingProcError (Sourcify::NoMatchingProcError) from cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:40:in
extracted_source' from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:22:insexp' from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/parser.rb:17:in
source' from /cloud-crawler/cloud-crawler/vendor/bundle/ruby/2.1.0/bundler/gems/sourcify-5767bd2a0c09/lib/sourcify/proc/methods/to_source.rb:39:into_source' from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:234:in
crawl' from /cloud-crawler/cloud-crawler/lib/cloud-crawler/driver.rb:49:instandalone_crawl' from ./test/test_crawl.rb:27:in
I am using ruby version 2.1.1.