crawlbase / proxycrawl-python

ProxyCrawl Python library for scraping and crawling
https://proxycrawl.com
Apache License 2.0
60 stars 19 forks source link

Missing data while using Smart Proxy to scrape mediamarkt. #11

Closed SalmanZafar-DataScience closed 2 years ago

SalmanZafar-DataScience commented 2 years ago

I have been trying to scrape Product Listing page at www.mediamarkt.de. (https://www.mediamarkt.de/de/category/smartphones-579.html?page=1). For this purpose I am using Smart Proxy, as mediamarkt blocks IP after hitting certain amount of requests. But with Smart Proxy, even with the Status Code of '200', I'm unable to get desired HTML with missing data too. I am attaching the python script for your reference. Kindly look into it.

proxycrawl -- mediamarkt.pdf

crawlbase commented 2 years ago

Thank you for opening an issue on the project. This seems to be more related to the service itself than to the python library.

In any case I have created an internal issue so our team can check the problem and give you an update as soon as possible.

For future service issues is better if you directly contact our support team as they will be able to provide you with faster support: https://proxycrawl.com/dashboard/support

crawlbase commented 2 years ago

@SalmanZafar-DataScience got an update back from support team, they have tested and said that the problem is that mediamarkt is loading the content dynamically via javascript, so you should do javascript requests instead of normal requests

You can do so by using the header 'ProxyCrawlAPI-Parameters': 'javascript=true'

You can read and see an example here: https://proxycrawl.com/docs/smart-proxy/#request-example