Closed caroheymes closed 8 months ago
Thanks @caroheymes
Indeed I haven't seen your previous issue. Not sure why. Sorry!
As a quick check, I just tried to crawl with advertools, and it seems OK (I tested 100 pages)
Here is the sample file:
https://docs.google.com/spreadsheets/d/12E-Sq3mELZPBS3RwZuyev3t5ClHBCQiAqIzNgvuPm4I
I didn't dig deep into the page, but it seems there is some JS causing the popup, and then it's blocking the page. But on the live website it works properly as in the shared file.
Just curious, why do you want to view it offline?
Hello Elias, thanks for your feedback. I am building a streamlit app to benchmark competitors prices.
Everyday, agents visit & collect the prices of different players and push the data in bigquery.
I need also to identify which products are unavailable. I've tried several xpath cf example below and defintely, I thing that I am blocked by the cookie wall (cf view response in scrapy)
urls = 'https://www.interflora.fr/p/roses-passion' output_file_details = 'export.jl' adv.crawl(url_list = urls, output_file=output_file_details, xpath_selectors={'button' : '//span[@data-test-id = "button-add-to-cart"].text()'}, custom_settings={'ROBOTSTXT_OBEY': False, 'DOWNLOAD_DELAY' : 1, 'USER_AGENT':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36'} df = pd.read_json(output_file_details, lines=True) df.columns => Float64Index([], dtype='float64')
the ld+json is unfortunately not enough
Thanks for your feedback
I think your selector is correct. But it seems this button is generated through some JS because its value is dynamic. I can't find it in the page source (but it's there when you inspect the element):
The word "épuisé" is itself épuisé :)
Checkout playwright maybe or other solutions. I'm thinking of options to tackle such situations. Hope this makes sense.
Hello Elias, I did not know playright before. Works fine for bypassing js wall ! Thanks a million
Hello Elias, I had already posted the topic some time ago on https://github.com/eliasdabbas/advertools/discussions/328, but I don't think you had seen it.
Thank you for the fantastic work you're doing with advertools.
However, I have an issue with websites that have a cookie wall, like on https://www.interflora.fr/p/roses-passion.
When I do scrapy shell view(response) I can clearly see that I am blocked. There is absolutely no element like the title, the button or body_text
So, I was wondering if you might have a fantastic idea to work around this issue.
Thanks a million !