alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
MIT License
6.24k stars 654 forks source link

About removing duplicate result #27

Closed Mervyen closed 4 years ago

Mervyen commented 4 years ago

I‘m sorry to add this issue, I dont konw whether this is an issue.

In my code.I dont want to remove the duplicate result,and I had tried to commented out some code.But it seems doesn't work,so I add this issue.

sorry for this issue again.Pls tell me If this is not an issue,I will delete this.

alirezamika commented 4 years ago

I understand your issue. I will make it as an option. Until then, have you tried using grouped=True parameter? results per group won't be unique.

Mervyen commented 4 years ago

I didn‘t have tried the group function。I would try this later。 thx!

NickGoto commented 4 years ago

I tried and it didnt work for me.

alirezamika commented 4 years ago

Please share your url or html content and your code for it so we can find the problem.

Mervyen commented 4 years ago

The website is "https://selected-cigars.com/en/partagas-serie-d-no-4" My wanted_list like ['Partagás - Serie D No. 4 1 piece','sold sout'].the store item just show once.

code:

from autoscraper import AutoScraper url = 'https://selected-cigars.com/en/partagas-serie-d-no-4' wanted_list = ['Partagás - Serie D No. 4 1 piece','Sold Out'] scraper = AutoScraper() result = scraper.build(url,wanted_list) print(result)

the output: ['Partagás - Serie D No. 4 1 piece', 'Partagás - Serie D No. 4 A/T 1Pc', 'Partagás - Serie D No. 4 A/T 3pcs', 'Partagás - Serie D No. 4 10pcs, wooden Box / 1 box per Customer', 'Partagás - Serie D No. 4 25pcs, wooden Box / 1 Box per Customer', 'Partagás - Serie D No. 4 A/T special 25er Metalltube', 'Sold Out', 'small quantities available']

In the output,the 'sold out ' item just show once.but in the web ,these items are more than 1.there are about 4 counts

alirezamika commented 4 years ago

I recommend to first use the grouped=True parameter. After analyzing the output keep the desired rules by keep_rules or remove_rules methods. Then if you want to get the result list, use the unique=False parameter.

Mervyen commented 4 years ago

THX

go-delicious commented 4 years ago

I was wondering how this works:

k: v if v != [] else '' for k, v in item.attrs.items() if k in key_attrs

I'm guessing it's shorthand for something. I didn't open a new issue as wanting to know how the code works doesn't seem like one.

alirezamika commented 4 years ago

It's creating a new dict from item.attrs, containing only the keys which are present in key_attrs. Also it is converting values of [] to ''