Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

Redirect correctly pages from feedproxy to final url #22

Closed midudev closed 6 years ago

midudev commented 8 years ago

Example of feedproxy url: http://feedproxy.google.com/~r/libertaddigital/nacional/~3/R1fhgiVwJmQ/story01.htm

Desired url to parse: http://www.libertaddigital.com/espana/politica/2015-10-19/casado-rechaza-debate-interno-sobre-el-liderazgo-de-rajoy-pero-aznar-vuelve-a-la-carga-1276559464/

I don't have time right now to make a pull requests but I put on my code, on the scraping side of it. As it's an special redirect, just in case it helps. feedproxy and feedsportal is a way to monetize feeds from Google and is very famous among occidental publishers. I hope it helps, it's written using coffeescript.

      # we ensure that we're not in a feedsportal crappy ads page
      if /feedsportal.com/.test(uri )
        # get the correct URL from the body page
        matches = /<a\s+(?:[^>]*?\s+)?href="([^"]*)"/.exec( body )
        if matches
          options.uri = matches[1]
          parsingData.html = matches[1]
          return requestArticle( options, parsingData, callback )