brejc8 / dealferret

MIT License
2 stars 1 forks source link

Scan code #2

Closed overture8 closed 6 years ago

overture8 commented 6 years ago

Hey,

Nice site! Where's the code that does the scanning / crawling of the retailer sites?

Thanks 👍

brejc8 commented 6 years ago

Thanks I don't make it public as the stores will simply ban the scrapers if there are being hammered by more than one. It is already touchy with a couple shops blocking based on user agent. If there is a specific store, I can send it to you. Or I can get you access to the daily scraped info.

overture8 commented 6 years ago

ah, no worries. I was just really interested on how you were scraping the sites. It's something I'm interested in and thought (as you mentioned) you'd be having issues with the retailers complaining.

brejc8 commented 6 years ago

In case you were wondering, I use Scrapy as the backbone of the scrapers. There are over 80 of them and they take a home dual core 2.7GHz PC about 4 hours to process. It is mostly CPU bound as you can execute several scrapers in parallel. I generate Json files with the data for each store, which get sent over to the webserver which updates the database. This also downloads and processes the images if it is a new item.

overture8 commented 6 years ago

Thanks for that! 👍

OscarVanL commented 5 years ago

One thing I was curious about, how do you go about enumerating the product listings from each website? Do you when adding a new website to deal ferret run some kind of crawler that visits every page on that website to discover product pages, and if you do this, how frequently do you re-run this process? Or is there no 'discovery' process and you use an alternative method? Also how do you merge listings for the same product on multiple websites? I'm very curious how you make sure deal ferret doesn't miss products, or keeps up to date when these shops add new product listings.

brejc8 commented 5 years ago

One thing I was curious about, how do you go about enumerating the product listings from each website?

Here is what a json entry for one product looks like:

{'image': u'https://img.tesco.com/Groceries/pi/076/5000462938076/IDShot_225x225.jpg',
 'name': 'Tesco Basic Napkins 30Cm 100 Pack',
 'path': [u'Home & Ents',
          u'Party Decorations & Party Supplies',
          u'Party Tableware',
          u'Napkins'],
 'price': 0.5,
 'promotions': ['Any 4 for 3 Cheapest Product Free'],
 'uid': 257391879L,
 'url': 'https://www.tesco.com/groceries/en-GB/products/257391879'}

This is the info the scraper will get from the site. A scrape may have a specific item more than once as it may exist in multiple sections. There is a UID in there. For every store, each product will have one UID no matter which section it was in. You can see the UID exists in the URL. The UIDs are unique with respect to each store. Each store will have their own enumeration scheme. Frustratingly, some stores will use a string as the product identifier, in which case I have to hash it into a 64bit number make sure there are no collisions. This is possible if the strings are length limited. When loading into the dealferret database, a lookup is done with key StoreID,UID to get the ProductID. In this case to Product ID 10779 (https://dealferret.uk/product.php?id=10779&ref=brejc8).

Do you when adding a new website to deal ferret run some kind of crawler that visits every page on that website to discover product pages, and if you do this, how frequently do you re-run this process? Or is there no 'discovery' process and you use an alternative method?

There is a daily crawler which scrapes all the products it can find off each site. It only looks at the listings pages and produces the JSON elements above. Each scrape may have new products (in which case they are added to the database). There may be products temporarily missing, due to the product being out of stack, temporarily removed or the site being down a the time. Each day a product is seen, the database last_seen value is updated. This keeps how long it has been since a product has been available. If it has been a week, it no longer appears in search results. A month, and we delete the image(to save space, we can always download it again). If it has been a year, it is deleted. Here are all the fields in the database:


id              int(11)     
storeid         int(11)     
uid             bigint(20)  
name            text
url             text
imageurl        text
price           decimal(10,2)   
multiprice      decimal(10,2)   
group_same      int(11)     
first_seen      date    
last_seen       date    
process         int(11)     
crc             int(11)

The CRC is a quick way to see if the name/imageurl/url need to be updated, without checking each one. First_seen is for the age of the product. Process is a flag to say the name should be processed and added to the tag search table (a separate task). The group_same is a link to another table which holds references to all products that are the same between stores. If multiple products have the same group_same, they are the same product, just from different stores.

Also how do you merge listings for the same product on multiple websites?

Grouping is done through magic or a manual process. Through magic is does image comparisons when it downloads a new image, to see is the image has already been seen. If so, then it also checks the names for similar components. It it goes past a threshold, it decides it must be the same product. This works reasonably, but fails badly in some situations e.g memory modules which have the same image and 90% of the description but different sizes.

OscarVanL commented 5 years ago

Wow, thank you for the detailed explanation. That's a lot more involved than I expected, with some nice engineering decisions to get around the complications you described. Thank you for taking the time to write that. If you didn't realise, I'm Oscar - the guy who got a whole load of bargain Asahi beer through deal ferret that emailed you. I'm a Computer Science student so was curious as to how you do all this. The image comparison method is clever, and explains an issue I noticed from the Dealferret Reddit bot where it suggested a best price for an SSD, but infact it was for a smaller capacity SSD than the one referenced.