Closed leandrosardi closed 2 months ago
When I trybrowserHtml
instead of httpResponseBody
, I get error 500 instead of 502.
curl \
--user *****************: \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.indeed.com/q-$40,000-l-Jacksonville,-FL-jobs.html?radius=0&sort=date&start=10",
"browserHtml": true
}' \
--compressed "https://api.zyte.com/v1/extract"
{"type":"/server/internal","title":"Internal Server Error","status":500,"detail":"The server encountered an internal error. Please open a support ticket from https://support.zyte.com/support/tickets/new or wait for us to resolve the issue."}
Could you please try a normal curl request like below:
curl \
--user $API-KEY: \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.indeed.com/q-$40,000-l-Jacksonville,-FL-jobs.html?radius=0&sort=date&start=10",
"httpResponseBody": true
}' \
--compressed "https://api.zyte.com/v1/extract"
For Indeed, using only httpResponseBody alone will not work due to CloudFlare's behaviour. You have to either use browserHtml which actually renders the page and provides you with the HTML in Zyte API response instead of a base64 encoded value.
Or you can also use the sessionContext with SessionContextParameter which will help in increasing the Success Rate, a sample payload is given below:
curl \
--user $API-KEY: \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.indeed.com/q-$40,000-l-Jacksonville,-FL-jobs.html?radius=0&sort=date&start=10",
"sessionContext": [
{
"name": "id",
"value": "2"
}
],
"sessionContextParameters": {
"actions": [
{
"action": "waitForTimeout",
"timeout": 5,
"onError": "return"
}
]
},
"httpResponseBody": true
}' \
--compressed "https://api.zyte.com/v1/extract"
Using sessionContext with SessionContextParameter is not working:
curl \
--user **********: \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.indeed.com/q-$40,000-l-Jacksonville,-FL-jobs.html?radius=0&sort=date&start=10",
"sessionContext": [
{
"name": "id",
"value": "2"
}
],
"sessionContextParameters": {
"actions": [
{
"action": "waitForTimeout",
"timeout": 1,
"onError": "return"
}
]
},
"httpResponseBody": true
}' \
--compressed "https://api.zyte.com/v1/extract"
{"type":"/server/internal","title":"Internal Server Error","status":500,"detail":"The server encountered an internal error. Please open a support ticket from https://support.zyte.com/support/tickets/new or wait for us to resolve the issue."}
from concurrent.futures import ThreadPoolExecutor
from w3lib.http import basic_auth_header
import requests
from base64 import b64decode
import json
import io
API_URL = "https://api.zyte.com/v1/extract"
API_KEY = ""
status_code_counts = {}
def send_request(i):
response = requests.post(API_URL, auth=(API_KEY, ''), json={
"url": "https://www.indeed.com/jobs?q=%2435%2C000&l=Miami%2C+FL&sort=date",
'httpResponseBody': True,
'httpResponseHeaders' : True,
"sessionContext": [{"name": "id", "value": "indeed"}],
"sessionContextParameters": {"actions": [{
"action": "waitForTimeout",
"timeout": 5,
"onError": "return"
}]},
"requestHeaders": {"referer": "https://www.google.com/"}
})
status_code = response.status_code
# Update status code count
if status_code in status_code_counts:
status_code_counts[status_code] += 1
else:
status_code_counts[status_code] = 1
if 'Retry-After' in response.headers:
print("Failed: Retry-After header found with status code " + str(status_code))
elif status_code == 500 or response.status_code == 521:
print("Server Error: " + str(status_code))
else:
print("Success: " + str(status_code))
decoded_html = b64decode(response.json()['httpResponseBody']).decode()
name1 = str(i) + "_zyte.html"
with io.open(name1, "w", encoding="utf-8") as f:
f.write(str(decoded_html))
return decoded_html
with ThreadPoolExecutor(max_workers=30) as executor:
futures = []
for i in range(100):
futures.append(executor.submit(send_request, i))
for future in futures:
result = future.result()
# Summarize status codes and success rate
total_requests = sum(status_code_counts.values())
success_requests = status_code_counts.get(200, 0) # Assuming 200 is the success status code
print("\nSummary of Status Codes:")
for code, count in status_code_counts.items():
print(f"Status Code {code}: {count} times")
success_rate = (success_requests / total_requests) * 100
print(f"\nSuccess Rate: {success_rate:.2f}%")
My respone to Zyte:
It seems it is working better now. Thanks.
So, basically you created a specific context for indeed at your server-side. Am I right?
Answer:
Hi,
Yes, I have made some configuration changes at my end for your account and thank you for the confirmation.
In that case, I will mark this ticket as resolved. If you have further questions, please feel free to reply to this ticket within 48 hours, and it will automatically reopen.
My message to Zyte:
Hello,
> I have made some configuration changes at my end for your account
I need the Indeed scraping works with any other Zyte account (not only with MY account).
This is because I am working on a SaaS that integrated with Zyte.
So, my clients will connect their own Zyte accounts.
How can we get this working for ANY other Zyte ?
Response from Zyte:
I understand. I have checked with the product team and was informed that the domain [indeed.com](https://indeed.com/) should be working without any issues for the new signup customers.
Reopening.
Hello Nagharajan,
Sadly, I am re-opening this ticket.
> I understand. I have checked with the product team and was informed that the domain [indeed.com](https://indeed.com/) should be working without any issues for the new signup customers.
My first client. Jeremy Morgan, who is in copy in this email; signed up to Zyte today, and his new account faces the same issue when trying to scrape Indeed.
Here are all the details of this issue.
https://github.com/MassProspecting/docs/issues/154
Jeremy's email in Zyte is [delonix.aus+zyte@gmail.com](mailto:delonix.aus%2Bzyte@gmail.com).
Reopening.
Hello Nagharajan, Sadly, I am re-opening this ticket. > I understand. I have checked with the product team and was informed that the domain [indeed.com](https://indeed.com/) should be working without any issues for the new signup customers. My first client. Jeremy Morgan, who is in copy in this email; signed up to Zyte today, and his new account faces the same issue when trying to scrape Indeed. Here are all the details of this issue. https://github.com/MassProspecting/docs/issues/154 Jeremy's email in Zyte is [delonix.aus+zyte@gmail.com](mailto:delonix.aus%2Bzyte@gmail.com).
This problem was solved the same day I reported it by Zyte support.
Problem
Getting response: