alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
MIT License
6.24k stars 654 forks source link

Scraper can't find requested data even though site is well-structured and consistent #31

Closed cstrouse closed 4 years ago

cstrouse commented 4 years ago

This site is consistent and well-structured with easily located selectors but autoscraper struggles with scraping the data. I trained it with a few examples which found the data successfully but subsequent attempts to scrape other pages yields missing data even though the markup is the same for these pages as the training pages.

from autoscraper import AutoScraper
scraper = AutoScraper()
scraper.build("https://www.weedsta.com/strains/blue-dream", ["Blue Dream", "Hybrid", "24.49%", "0.19%", "10 Reviews"], update=True)
scraper.build("https://www.weedsta.com/strains/trainwreck", ["Trainwreck", "Sativa", "18.63%", "0.53%", "3 Reviews"], update=True)
scraper.build("https://www.weedsta.com/strains/sour-diesel", ["Sour Diesel", "Sativa", "22.2%", "0.31%", "8 Reviews"], update=True)

Here's an example where you can see that the percentages are not returned.

>>> scraper.get_result_similar('https://www.weedsta.com/strains/banana-kush', grouped=True)
{'rule_dt56': ['Banana Kush'], 'rule_l8fu': ['Banana Kush'], 'rule_7m0b': [], 'rule_fq5s': ['1 Reviews'], 'rule_4lqv': [], 'rule_bq2d': [], 'rule_mgmx': [], 'rule_pshq': [], 'rule_cnvq': ['Banana Kush'], 'rule_bmx8': ['Banana Kush'], 'rule_3npf': [], 'rule_7ko7': [], 'rule_tfnf': [], 'rule_ia0h': []}
>>> 
alirezamika commented 4 years ago

The reason is that the website uses different style tag values in different pages. Adding an option to ignore styling may be good.

alirezamika commented 4 years ago

A new feature has been added to the new version (v1.1.7) which you can use as a workaround for this problem:

scraper.get_result_similar('https://www.weedsta.com/strains/banana-kush', attr_fuzz_ratio=0.8)
cstrouse commented 4 years ago

@alirezamika Works great. Thanks a bunch!