SaptakS / opinator

A plugin to do sentiment analysis of reviews in ecommerce website.
9 stars 5 forks source link

Remove scrapy as your crawler. #8

Open fluffybeing opened 9 years ago

fluffybeing commented 9 years ago

I think what you are doing in the project is that for every request for product review you are creating a scrapy job. This is compute heavy job and will not handle more than 10 requests at a time on 4GB ram machine. Scrapy is for crawling many pages not one page and so use requests with lxml for scrapping content from a single url.

>>> import requests
>>> import lxml

>>> response = requests.get(URL)
>>> data = lxml.etree.parse(response)
# now you can use xpath here the same way you did in scrapy.
>>> data.xpath('')
SaptakS commented 9 years ago

But in case of scraping we have to crawl through more than pages since, while scraping, after we reach the next page we need to scrap that page again.

fluffybeing commented 9 years ago

But how many? Any approximate number ? I think requests and lxml can easily do that. you can extract all the links in href and then parse the one you want.

SaptakS commented 9 years ago

h its true that it can be done. But we were thinking that if in future we generalize this project and include the reviews of other e-markettinge websites as well then we will need scrapy maybe. I am not sure. We will surely see into it.

apexkid commented 9 years ago

I think an appox of 100 reviews will be sufficient to calculate SS (Sentiment Score). One page has apprx 15 reviews so it amounts to 8 page hits per sentiment request. @rahulrrixe @SaptakS Decide based on data point.

apexkid commented 9 years ago

Also premature optimization in my opinion is a curse. Lets focus on what gets our MVP ready first which required ease of code. If scrapy permits it then its ok to use it. Keep this issue open for future so we know where to optimize.