BoTreeConsultingTeam / scrapper

Scrapping Twitter and FB pages
0 stars 0 forks source link

Feature - Scrape individual authentic blog links from twitter using keywords #1

Open ParthBarot-BoTreeConsulting opened 8 years ago

ParthBarot-BoTreeConsulting commented 8 years ago
  1. Scrape twitter tweets based on keywords/hashtags, and find blog links from it.
  2. Scrape each blog - find title, description , keywords, author, social media links.
  3. Store it in the CSV file.

We can use Wombat for this - https://github.com/felipecsl/wombat - which is built on mechanize/nokogiri.

ParthBarot-BoTreeConsulting commented 8 years ago

As of now, we can think of the following solution.

  1. Automated worker, using Wombat, will find new data from Twitter tweets and identify URLs - These URLs must be for US only, so we can avoid all URLs other than \.(com|info|net|org|us). Each URL will be added for a parsing Job.
  2. Parsing Job, running using ZeroMQ/Sidekiq - For each url, we need to scrape the following and match keywords with defined keywords we have to identify if this URL matches as per our criteria or not.
    • title
    • description
    • keywords
    • author
    • Social links - TW/FB/LIN/G+/YT/Pinterest etc.
  3. Job will add the processed job in a database with all the details with "pending download" status.
  4. Once downloaded, status will be "done".

After downloading, a team would need to verify manually if the URL is really appears as per expectations or not.

ParthBarot-BoTreeConsulting commented 8 years ago

@HirenBhalani-BTC I have tested the following, basically it is able to fetch more pages without login, using mechanize only. Now, we need to parse the content and fetch links.

def doIt()
  mechanize = Mechanize.new
  params = {q: '#fashion #blog #makeup', s: 'typd'}.to_param
  search_url = "https://mobile.twitter.com/search?#{params}"
  mp = mechanize.get(search_url)
  i = 0

  File.open('aaa.html', 'a+') do |f|
    f.syswrite(mp.content)
    more_link = mp.link_with(text: " Load older Tweets ")
    while(i < 10 && more_link.present?) 
      mp = more_link.click
      f.syswrite(mp.content)
      more_link = mp.link_with(text: " Load older Tweets ")
      i+=1
      puts "Loading..."
    end
  end
  puts "Completed!"

end
ParthBarot-BoTreeConsulting commented 8 years ago

Reference Links

http://blog.saush.com/2009/03/17/write-an-internet-search-engine-with-200-lines-of-ruby-code/ https://github.com/felipecsl/wombat https://github.com/joenorton/rubyretriever https://github.com/chriskite/anemone https://github.com/peterc/pismo

Google search engine gems

https://github.com/wiseleyb/google_custom_search_api https://github.com/alexreisner/google_custom_search https://github.com/tj/google-search - 200 starred

ParthBarot-BoTreeConsulting commented 8 years ago

Queue management References

ZeroMQ

http://zeromq.org/bindings:ruby https://github.com/zeromq/rbzmq http://zguide.zeromq.org/page:all http://www.sitepoint.com/zeromq-ruby/

http://blog.willj.net/2010/08/01/basic-zero-mq-ruby-example/ https://github.com/andrewvc/learn-ruby-zeromq http://stackshare.io/stackups/sidekiq-vs-zeromq#more

RabbitMQ

https://www.rabbitmq.com/tutorials/tutorial-one-ruby.html https://github.com/ruby-amqp/bunny https://github.com/ruby-amqp/amqp http://www.bestechvideos.com/2008/12/09/rabbitmq-an-open-source-messaging-broker-that-just-works

Comparison - http://blog.x-aeon.com/2013/04/10/a-quick-message-queue-benchmark-activemq-rabbitmq-hornetq-qpid-apollo/