Open ParthBarot-BoTreeConsulting opened 8 years ago
As of now, we can think of the following solution.
\.(com|info|net|org|us)
. Each URL will be added for a parsing Job.After downloading, a team would need to verify manually if the URL is really appears as per expectations or not.
@HirenBhalani-BTC I have tested the following, basically it is able to fetch more pages without login, using mechanize only. Now, we need to parse the content and fetch links.
def doIt()
mechanize = Mechanize.new
params = {q: '#fashion #blog #makeup', s: 'typd'}.to_param
search_url = "https://mobile.twitter.com/search?#{params}"
mp = mechanize.get(search_url)
i = 0
File.open('aaa.html', 'a+') do |f|
f.syswrite(mp.content)
more_link = mp.link_with(text: " Load older Tweets ")
while(i < 10 && more_link.present?)
mp = more_link.click
f.syswrite(mp.content)
more_link = mp.link_with(text: " Load older Tweets ")
i+=1
puts "Loading..."
end
end
puts "Completed!"
end
Reference Links
http://blog.saush.com/2009/03/17/write-an-internet-search-engine-with-200-lines-of-ruby-code/ https://github.com/felipecsl/wombat https://github.com/joenorton/rubyretriever https://github.com/chriskite/anemone https://github.com/peterc/pismo
Google search engine gems
https://github.com/wiseleyb/google_custom_search_api https://github.com/alexreisner/google_custom_search https://github.com/tj/google-search - 200 starred
Queue management References
ZeroMQ
http://zeromq.org/bindings:ruby https://github.com/zeromq/rbzmq http://zguide.zeromq.org/page:all http://www.sitepoint.com/zeromq-ruby/
http://blog.willj.net/2010/08/01/basic-zero-mq-ruby-example/ https://github.com/andrewvc/learn-ruby-zeromq http://stackshare.io/stackups/sidekiq-vs-zeromq#more
RabbitMQ
https://www.rabbitmq.com/tutorials/tutorial-one-ruby.html https://github.com/ruby-amqp/bunny https://github.com/ruby-amqp/amqp http://www.bestechvideos.com/2008/12/09/rabbitmq-an-open-source-messaging-broker-that-just-works
Comparison - http://blog.x-aeon.com/2013/04/10/a-quick-message-queue-benchmark-activemq-rabbitmq-hornetq-qpid-apollo/
We can use Wombat for this - https://github.com/felipecsl/wombat - which is built on mechanize/nokogiri.