calgo-lab / green-db

The monorepo that powers the GreenDB.
https://calgo-lab.github.io/green-db/
22 stars 2 forks source link

Fix zalando extractor and improve spider #64

Closed BigDatalex closed 2 years ago

BigDatalex commented 2 years ago

Fix the bug in the zalando extractor: https://github.com/calgo-lab/green-db/issues/62

Additionally I realized that not all links are found in the SERP pages (regardless of minimal or scrolling script), just about 28 out of 80 links were found per SERP page. I don't know if Zalando also changed the SERP pages or not all links were extracted with the original most class selector approach. But this was probably not working before, because we just have 4000 shoes from the last crawl and checking: zalando sustainable women shoes you can see that even for this category there are more than 8000 shoes listed.

BigDatalex commented 2 years ago

Is this PR a superset of #63 ? There are several changes that look at least similar. The name also suggest that is could be a mistake?

Yes, it is a superset of #63