MassProspecting / docs

Public documentation, roadmap and issue tracker of MassProspecting
http://doc.massprospecting.com/
1 stars 0 forks source link

Zyte is not scraping Indeed. #154

Closed leandrosardi closed 2 months ago

leandrosardi commented 3 months ago

Problem

curl \
  --user <ZYTE API KEY HERE>: \
  --header 'Content-Type: application/json' \
  --data '{ 
 "url": "https://www.indeed.com/q-$40,000-l-Jacksonville,-FL-jobs.html?radius=0&sort=date&start=10",
 "httpResponseBody": true
  }' \
  --compressed "https://api.zyte.com/v1/extract" 

Getting response:

{"type":"/download/temporary-error","title":"Temporary Downloading Error","status":520,"detail":"There is a downloading problem which might be temporary. Retry in N seconds from 'Retry-After' header or open a support ticket from https://support.zyte.com/support/tickets/new if it fails consistently."}
leandrosardi commented 3 months ago

When I trybrowserHtml instead of httpResponseBody, I get error 500 instead of 502.

curl \
  --user *****************: \
  --header 'Content-Type: application/json' \
  --data '{ 
  "url": "https://www.indeed.com/q-$40,000-l-Jacksonville,-FL-jobs.html?radius=0&sort=date&start=10",
  "browserHtml": true
  }' \
  --compressed "https://api.zyte.com/v1/extract" 
{"type":"/server/internal","title":"Internal Server Error","status":500,"detail":"The server encountered an internal error. Please open a support ticket from https://support.zyte.com/support/tickets/new or wait for us to resolve the issue."}
leandrosardi commented 3 months ago

Response

Could you please try a normal curl request like below:

curl \
  --user $API-KEY: \
  --header 'Content-Type: application/json' \
  --data '{ 
 "url": "https://www.indeed.com/q-$40,000-l-Jacksonville,-FL-jobs.html?radius=0&sort=date&start=10",
 "httpResponseBody": true
  }' \
  --compressed "https://api.zyte.com/v1/extract" 

For Indeed, using only httpResponseBody alone will not work due to CloudFlare's behaviour. You have to either use browserHtml which actually renders the page and provides you with the HTML in Zyte API response instead of a base64 encoded value.

Or you can also use the sessionContext with SessionContextParameter which will help in increasing the Success Rate, a sample payload is given below:

curl \
  --user $API-KEY: \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://www.indeed.com/q-$40,000-l-Jacksonville,-FL-jobs.html?radius=0&sort=date&start=10",
  "sessionContext": [
    {
      "name": "id",
      "value": "2"
    }
  ],
  "sessionContextParameters": {
    "actions": [
      {
        "action": "waitForTimeout",
        "timeout": 5,
        "onError": "return"
      }
    ]
  },
  "httpResponseBody": true
}' \
  --compressed "https://api.zyte.com/v1/extract" 
leandrosardi commented 3 months ago

Using sessionContext with SessionContextParameter is not working:

curl \
   --user **********: \
   --header 'Content-Type: application/json' \
   --data '{
   "url": "https://www.indeed.com/q-$40,000-l-Jacksonville,-FL-jobs.html?radius=0&sort=date&start=10",
   "sessionContext": [
     {
       "name": "id",
       "value": "2"
     }
   ],
   "sessionContextParameters": {
     "actions": [
       {
         "action": "waitForTimeout",
         "timeout": 1,
         "onError": "return"
       }
     ]
   },
   "httpResponseBody": true
 }' \
   --compressed "https://api.zyte.com/v1/extract" 
{"type":"/server/internal","title":"Internal Server Error","status":500,"detail":"The server encountered an internal error. Please open a support ticket from https://support.zyte.com/support/tickets/new or wait for us to resolve the issue."}
leandrosardi commented 3 months ago

Response from Zyte support:

from concurrent.futures import ThreadPoolExecutor
from w3lib.http import basic_auth_header
import requests
from base64 import b64decode
import json
import io

API_URL = "https://api.zyte.com/v1/extract"
API_KEY = ""

status_code_counts = {}

def send_request(i):
    response = requests.post(API_URL, auth=(API_KEY, ''), json={
        "url": "https://www.indeed.com/jobs?q=%2435%2C000&l=Miami%2C+FL&sort=date",
        'httpResponseBody': True,
        'httpResponseHeaders' : True,
        "sessionContext": [{"name": "id", "value": "indeed"}],
        "sessionContextParameters": {"actions": [{
            "action": "waitForTimeout",
            "timeout": 5,
            "onError": "return"
        }]},
        "requestHeaders": {"referer": "https://www.google.com/"}
    })
    status_code = response.status_code

    # Update status code count
    if status_code in status_code_counts:
        status_code_counts[status_code] += 1
    else:
        status_code_counts[status_code] = 1

    if 'Retry-After' in response.headers:
        print("Failed: Retry-After header found with status code " + str(status_code))
    elif status_code == 500 or response.status_code == 521:
        print("Server Error: " + str(status_code))
    else:
        print("Success: " + str(status_code))
        decoded_html = b64decode(response.json()['httpResponseBody']).decode()
        name1 = str(i) + "_zyte.html"
        with io.open(name1, "w", encoding="utf-8") as f:
            f.write(str(decoded_html))
        return decoded_html

with ThreadPoolExecutor(max_workers=30) as executor:
    futures = []
    for i in range(100):
        futures.append(executor.submit(send_request, i))
    for future in futures:
        result = future.result()

# Summarize status codes and success rate
total_requests = sum(status_code_counts.values())
success_requests = status_code_counts.get(200, 0)  # Assuming 200 is the success status code

print("\nSummary of Status Codes:")
for code, count in status_code_counts.items():
    print(f"Status Code {code}: {count} times")

success_rate = (success_requests / total_requests) * 100
print(f"\nSuccess Rate: {success_rate:.2f}%")        
leandrosardi commented 3 months ago

My respone to Zyte:

It seems it is working better now. Thanks.
So, basically you created a specific context for indeed at your server-side. Am I right?

Answer:

Hi,

Yes, I have made some configuration changes at my end for your account and thank you for the confirmation.

In that case, I will mark this ticket as resolved. If you have further questions, please feel free to reply to this ticket within 48 hours, and it will automatically reopen. 
leandrosardi commented 3 months ago

My message to Zyte:

Hello,

> I have made some configuration changes at my end for your account

I need the Indeed scraping works with any other Zyte account (not only with MY account).

This is because I am working on a SaaS that integrated with Zyte.
So, my clients will connect their own Zyte accounts.

How can we get this working for ANY other Zyte ?

Response from Zyte:

I understand. I have checked with the product team and was informed that the domain [indeed.com](https://indeed.com/) should be working without any issues for the new signup customers.
leandrosardi commented 2 months ago

Reopening.

Hello Nagharajan,

Sadly, I am re-opening this ticket.

> I understand. I have checked with the product team and was informed that the domain [indeed.com](https://indeed.com/) should be working without any issues for the new signup customers.

My first client. Jeremy Morgan, who is in copy in this email; signed up to Zyte today, and his new account faces the same issue when trying to scrape Indeed.

Here are all the details of this issue.
https://github.com/MassProspecting/docs/issues/154

Jeremy's email in Zyte is [delonix.aus+zyte@gmail.com](mailto:delonix.aus%2Bzyte@gmail.com).
leandrosardi commented 2 months ago

Reopening.

Hello Nagharajan,

Sadly, I am re-opening this ticket.

> I understand. I have checked with the product team and was informed that the domain [indeed.com](https://indeed.com/) should be working without any issues for the new signup customers.

My first client. Jeremy Morgan, who is in copy in this email; signed up to Zyte today, and his new account faces the same issue when trying to scrape Indeed.

Here are all the details of this issue.
https://github.com/MassProspecting/docs/issues/154

Jeremy's email in Zyte is [delonix.aus+zyte@gmail.com](mailto:delonix.aus%2Bzyte@gmail.com).

This problem was solved the same day I reported it by Zyte support.