TeamHG-Memex / autopager

Detect and classify pagination links
98 stars 25 forks source link

Error when passing Scrapy response #3

Closed ittailup closed 8 years ago

ittailup commented 8 years ago

There seems to be a problem when trying to pass a Scrapy response to autopager. The same page works when using requests instead of Scrapy.

(ipython)➜  TweetScraper git:(master) ✗ scrapy shell http://elcomercio.pe/buscar/ppk
2016-04-09 23:06:27 [scrapy] INFO: Scrapy 1.0.5 started (bot: TweetScraper)
2016-04-09 23:06:27 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-04-09 23:06:27 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'TweetScraper.spiders', 'LOG_LEVEL': 'INFO', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['TweetScraper.spiders'], 'BOT_NAME': 'TweetScraper', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'TweetScraper'}
2016-04-09 23:06:27 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2016-04-09 23:06:27 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-09 23:06:27 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-09 23:06:27 [scrapy] INFO: Enabled item pipelines: SaveToFilePipeline
2016-04-09 23:06:27 [scrapy] INFO: Spider opened
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x108a57090>
[s]   item       {}
[s]   request    <GET http://elcomercio.pe/buscar/ppk>
[s]   response   <200 http://elcomercio.pe/buscar/ppk>
[s]   settings   <scrapy.settings.Settings object at 0x109e7e250>
[s]   spider     <DefaultSpider 'default' at 0x10bdb1890>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: import autopager

In [2]: autopager.select(response)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-355d1f012366> in <module>()
----> 1 autopager.select(response)

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/autopager.pyc in select(page, direct, prev, next)
     36     By default, all link types are returned.
     37     """
---> 38     return get_shared_autopager().select(page, direct, prev, next)
     39
     40

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/autopager.pyc in select(self, page, direct, prev, next)
     96         """
     97         links = self.extract(page, prev=prev, next=next, direct=direct)
---> 98         return parsel.SelectorList([x for y, x in links])
     99
    100     def extract(self, page, direct=True, prev=True, next=True):

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/autopager.pyc in extract(self, page, direct, prev, next)
    110         sel = _any2selector(page)
    111         links = get_links(sel)
--> 112         xseq = page_to_features(links)
    113         yseq = self.crf.predict_single(xseq)
    114         for x, y in zip(links, yseq):

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/model.pyc in page_to_features(xseq)
    126
    127 def page_to_features(xseq):
--> 128     features = [link_to_features(a) for a in xseq]
    129
    130     around = get_text_around_selector_list(xseq, max_length=15)

/Users/gabriel/.virtualenvs/ipython/lib/python2.7/site-packages/autopager/model.pyc in link_to_features(link)
     60     )
     61
---> 62     elem = link.root
     63     elem_target = _elem_attr(elem, 'target')
     64     elem_rel = _elem_attr(elem, 'rel')

AttributeError: 'Selector' object has no attribute 'root'
ittailup commented 8 years ago

I was successful by passing response through a selector object and sending this, extracted, rather than a response object.

In [16]: sel = Selector(response)

In [17]: autopager.urls(sel.extract())
Out[17]:
[u'http://elcomercio.pe/buscar/ppk/?start=15',
 u'http://elcomercio.pe/buscar/ppk/?start=30']
kmike commented 8 years ago

Aha, your example works for me as-is (autopager.select(response)) in Scrapy 1.1.0rc3 + Python 3.5 because Scrapy 1.1.0rc3 uses parsel library. Scrapy 1.0.5 has selectors built-in, and there are some differences (.root attribute is available as ._root).