Closed overture8 closed 6 years ago
Thanks I don't make it public as the stores will simply ban the scrapers if there are being hammered by more than one. It is already touchy with a couple shops blocking based on user agent. If there is a specific store, I can send it to you. Or I can get you access to the daily scraped info.
ah, no worries. I was just really interested on how you were scraping the sites. It's something I'm interested in and thought (as you mentioned) you'd be having issues with the retailers complaining.
In case you were wondering, I use Scrapy as the backbone of the scrapers. There are over 80 of them and they take a home dual core 2.7GHz PC about 4 hours to process. It is mostly CPU bound as you can execute several scrapers in parallel. I generate Json files with the data for each store, which get sent over to the webserver which updates the database. This also downloads and processes the images if it is a new item.
Thanks for that! 👍
One thing I was curious about, how do you go about enumerating the product listings from each website? Do you when adding a new website to deal ferret run some kind of crawler that visits every page on that website to discover product pages, and if you do this, how frequently do you re-run this process? Or is there no 'discovery' process and you use an alternative method? Also how do you merge listings for the same product on multiple websites? I'm very curious how you make sure deal ferret doesn't miss products, or keeps up to date when these shops add new product listings.
One thing I was curious about, how do you go about enumerating the product listings from each website?
Here is what a json entry for one product looks like:
{'image': u'https://img.tesco.com/Groceries/pi/076/5000462938076/IDShot_225x225.jpg',
'name': 'Tesco Basic Napkins 30Cm 100 Pack',
'path': [u'Home & Ents',
u'Party Decorations & Party Supplies',
u'Party Tableware',
u'Napkins'],
'price': 0.5,
'promotions': ['Any 4 for 3 Cheapest Product Free'],
'uid': 257391879L,
'url': 'https://www.tesco.com/groceries/en-GB/products/257391879'}
This is the info the scraper will get from the site. A scrape may have a specific item more than once as it may exist in multiple sections. There is a UID in there. For every store, each product will have one UID no matter which section it was in. You can see the UID exists in the URL. The UIDs are unique with respect to each store. Each store will have their own enumeration scheme. Frustratingly, some stores will use a string as the product identifier, in which case I have to hash it into a 64bit number make sure there are no collisions. This is possible if the strings are length limited. When loading into the dealferret database, a lookup is done with key StoreID,UID to get the ProductID. In this case to Product ID 10779 (https://dealferret.uk/product.php?id=10779&ref=brejc8).
Do you when adding a new website to deal ferret run some kind of crawler that visits every page on that website to discover product pages, and if you do this, how frequently do you re-run this process? Or is there no 'discovery' process and you use an alternative method?
There is a daily crawler which scrapes all the products it can find off each site. It only looks at the listings pages and produces the JSON elements above. Each scrape may have new products (in which case they are added to the database). There may be products temporarily missing, due to the product being out of stack, temporarily removed or the site being down a the time. Each day a product is seen, the database last_seen value is updated. This keeps how long it has been since a product has been available. If it has been a week, it no longer appears in search results. A month, and we delete the image(to save space, we can always download it again). If it has been a year, it is deleted. Here are all the fields in the database:
id int(11)
storeid int(11)
uid bigint(20)
name text
url text
imageurl text
price decimal(10,2)
multiprice decimal(10,2)
group_same int(11)
first_seen date
last_seen date
process int(11)
crc int(11)
The CRC is a quick way to see if the name/imageurl/url need to be updated, without checking each one. First_seen is for the age of the product. Process is a flag to say the name should be processed and added to the tag search table (a separate task). The group_same is a link to another table which holds references to all products that are the same between stores. If multiple products have the same group_same, they are the same product, just from different stores.
Also how do you merge listings for the same product on multiple websites?
Grouping is done through magic or a manual process. Through magic is does image comparisons when it downloads a new image, to see is the image has already been seen. If so, then it also checks the names for similar components. It it goes past a threshold, it decides it must be the same product. This works reasonably, but fails badly in some situations e.g memory modules which have the same image and 90% of the description but different sizes.
Wow, thank you for the detailed explanation. That's a lot more involved than I expected, with some nice engineering decisions to get around the complications you described. Thank you for taking the time to write that. If you didn't realise, I'm Oscar - the guy who got a whole load of bargain Asahi beer through deal ferret that emailed you. I'm a Computer Science student so was curious as to how you do all this. The image comparison method is clever, and explains an issue I noticed from the Dealferret Reddit bot where it suggested a best price for an SSD, but infact it was for a smaller capacity SSD than the one referenced.
Hey,
Nice site! Where's the code that does the scanning / crawling of the retailer sites?
Thanks 👍