Hardeepex / webscraper

1 stars 0 forks source link

Sweep: i want to modify webscraper code #4

Closed Hardeepex closed 10 months ago

Hardeepex commented 10 months ago

Details

This is the webscraping code

import httpx from selectolax.parser import HTMLParser import time

def get_html(baseurl, page): headers = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0" } resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True) try:

    resp.raise_for_status()
    return HTMLParser(resp.text)
except httpx.HTTPStatusError as exc:
    print(f"Error response {exc.response.status_code} while requesting {exc.request.url!r}. Page Limit Exceeded")
    return False

def extract_text(html, sel): try: return html.css_first(sel).text() except AttributeError: return None

def parse_page(html): products = html.css("li.VcGDfKky_dvNbxUqmZ9K") for product in products: item = { "name": extract_text(product, "span[data-ui='product-title']"), "price": extract_text(product, "span[data-ui=sale-price]"), "savings": extract_text(product, "div[data-ui=savings-percent-variant2]"), } yield item

def main(): baseurl = "https://www.rei.com/c/camping-and-hiking/f/scd-deals?page=" for x in range(1, 100): print(f"Gathering page: {x}") html = get_html(baseurl, x) if html is False: break data = parse_page(html) for item in data: print(item) time.sleep(1)

if name == "main": main()

I have upload the data.txt file for html output which i got from #https://github.com/Hardeepex/webscraper/blob/main/src/rawyhtmlscraper.py

this is the single product scraper https://github.com/Hardeepex/webscraper/blob/main/src/singleproduct.py

i want to modify this scraper file https://github.com/Hardeepex/webscraper/blob/main/src/scraping.py

according to the data.txt file create a scraper for scrape the product details from all pages now mine scraper is able to explore the pages but dont know the output is blank

Checklist - [X] Modify `src/scraping.py` ✓ https://github.com/Hardeepex/webscraper/commit/aedc0716be9900fb3a470aed1a78172a9ac42088 [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L5-L15) - [X] Running GitHub Actions for `src/scraping.py` ✓ [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L5-L15) - [X] Modify `src/scraping.py` ✓ https://github.com/Hardeepex/webscraper/commit/a260716eec4a598b258a3c8cc0a449ebd3ab1b0f [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L17-L21) - [X] Running GitHub Actions for `src/scraping.py` ✓ [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L17-L21) - [X] Modify `src/scraping.py` ✓ https://github.com/Hardeepex/webscraper/commit/a11932f214bbae6a87d3848eb942327bedb1a016 [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L23-L31) - [X] Running GitHub Actions for `src/scraping.py` ✓ [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L23-L31) - [X] Modify `src/scraping.py` ✓ https://github.com/Hardeepex/webscraper/commit/006569abc2e62d93af8eca63998ca44f05ae7401 [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L33-L43) - [X] Running GitHub Actions for `src/scraping.py` ✓ [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L33-L43)
sweep-ai[bot] commented 10 months ago

🚀 Here's the PR! #5

See Sweep's progress at the progress dashboard!
💎 Sweep Pro: I'm using GPT-4. You have unlimited GPT-4 tickets. (tracking ID: 8422ac6aec)

[!TIP] I'll email you at hardeep.ex@gmail.com when I complete this pull request!


Actions (click)

Sandbox Execution ✓

Here are the sandbox execution logs prior to making any changes:

Sandbox logs for b24f724
Checking src/scraping.py for syntax errors... ✅ src/scraping.py has no syntax errors! 1/1 ✓
Checking src/scraping.py for syntax errors...
✅ src/scraping.py has no syntax errors!

Sandbox passed on the latest main, so sandbox checks will be enabled for this issue.


Step 1: 🔎 Searching

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I think are relevant in decreasing order of relevance (click to expand). If some file is missing from here, you can mention the path in the ticket description. https://github.com/Hardeepex/webscraper/blob/b24f724b78db7226bbcaa42c7399eb24e12ba683/src/scraping.py#L1-L46 https://github.com/Hardeepex/webscraper/blob/b24f724b78db7226bbcaa42c7399eb24e12ba683/src/rawyhtmlscraper.py#L1-L65
I also found the following external resources that might be helpful: **Summaries of links found in the content:** https://github.com/Hardeepex/webscraper/blob/main/src/singleproduct.py: The page contains code for a web scraping script written in Python. The script uses the `httpx` library to send HTTP requests and the `selectolax` library to parse HTML. The main function of the script is to scrape product details from multiple pages on the website "https://www.rei.com/c/camping-and-hiking/f/scd-deals". It does this by iterating over a range of page numbers and calling the `get_html` function to retrieve the HTML content of each page. The `parse_page` function is then called to extract the desired product details from the HTML. The script also includes a helper function called `extract_text` which uses CSS selectors to extract text from HTML elements. The user wants to modify the `scraping.py` file to scrape product details from all pages based on the data in the `data.txt` file. They mention that their current scraper is able to explore the pages but the output is blank. https://github.com/Hardeepex/webscraper/blob/main/src/scraping.py: The page contains a Python script for web scraping. The script uses the `httpx` library to send HTTP requests and the `selectolax` library to parse HTML. The `get_html` function takes a base URL and a page number as input and sends a GET request to the specified URL with the page number appended. It sets the User-Agent header to mimic a web browser and follows redirects. If the response is successful, it returns the parsed HTML using the `HTMLParser` class. If there is an HTTP status error, it prints an error message and returns False. The `extract_text` function takes the parsed HTML and a CSS selector as input and attempts to extract the text content of the first element matching the selector. If the element is not found, it returns None. The `parse_page` function takes the parsed HTML and extracts information about products. It selects all `li` elements with the class `VcGDfKky_dvNbxUqmZ9K` and extracts the name, price, and savings information using the `extract_text` function. It yields a dictionary containing the extracted information for each product. The `main` function sets the base URL for the website to scrape and iterates over page numbers from 1 to 99. It calls the `get_html` function to retrieve the HTML for each page and checks if it is False (indicating an error). If the HTML is valid, it calls the `parse_page` function to extract product information and prints each item. It also includes a delay of 1 second between requests. The script is meant to be executed directly when run as the main module. The user wants to modify the `scraping.py` file to scrape product details from all pages using the data from the `data.txt` file. They mentioned that the current scraper is able to explore the pages but the output is blank. https://www.rei.com/c/camping-and-hiking/f/scd-deals?page: The page you provided is titled "Access Denied" and it states that the user does not have permission to access a specific URL on the server. The page also includes a reference number. The problem the user is trying to solve involves web scraping. They have provided code that uses the HTTPX library and the Selectolax parser to scrape data from a website. The code makes requests to a base URL with different page numbers and extracts information such as product names, prices, and savings. The user wants to modify their existing scraper to scrape product details from all pages, but they are currently getting blank output. They have also mentioned that they have uploaded an HTML output file and a single product scraper file. https://github.com/Hardeepex/webscraper/blob/main/src/rawyhtmlscraper.py: The page contains a Python script for web scraping. The script uses the `httpx` library to make HTTP requests and the `selectolax` library to parse HTML. The `get_html` function takes a base URL and a page number as input and retrieves the HTML content of the specified page. It sets the User-Agent header and handles HTTP errors. The `extract_text` function extracts text from HTML nodes based on a given selector. It uses the `css_first` method of the HTMLParser object to select the first matching node and returns its text content. The `parse_page` function takes the parsed HTML as input and extracts product details from the HTML nodes. It selects specific elements using CSS selectors and creates a dictionary with the extracted information. It yields each dictionary as an item. The `main` function is the entry point of the script. It defines the base URL for the website to scrape and iterates over a range of page numbers. It calls the `get_html` function to retrieve the HTML for each page and checks if the HTML is valid. If it is, it calls the `parse_page` function to extract the product details and prints them. It also includes a sleep function to avoid overwhelming the server with requests. The user wants to modify the `scraping.py` file to scrape product details from all pages based on the `data.txt` file. They mention that the current scraper is able to explore the pages but the output is blank.

Step 2: ⌨️ Coding

--- 
+++ 
@@ -4,9 +4,12 @@

 def get_html(baseurl, page):
     headers = {
-        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0"
+        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/111.0"
     }
     resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True)
+    if resp.text == '':
+        print(f"Blank response for {resp.url}.")
+        return False
     try:

         resp.raise_for_status()

Ran GitHub Actions for aedc0716be9900fb3a470aed1a78172a9ac42088:

--- 
+++ 
@@ -4,9 +4,12 @@

 def get_html(baseurl, page):
     headers = {
-        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0"
+        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/111.0"
     }
     resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True)
+    if resp.text == '':
+        print(f"Blank response for {resp.url}.")
+        return False
     try:

         resp.raise_for_status()
@@ -26,8 +29,8 @@
     for product in products:
         item = {
            "name": extract_text(product, "span[data-ui='product-title']"),
-            "price": extract_text(product, "span[data-ui=sale-price]"),
-            "savings": extract_text(product, "div[data-ui=savings-percent-variant2]"),
+            "price": extract_text(product, "span[data-ui='sale-price']"),
+            "savings": extract_text(product, "span[data-ui='savings']"),
         }
         yield item

Ran GitHub Actions for a260716eec4a598b258a3c8cc0a449ebd3ab1b0f:

--- 
+++ 
@@ -4,9 +4,12 @@

 def get_html(baseurl, page):
     headers = {
-        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0"
+        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/111.0"
     }
     resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True)
+    if resp.text == '':
+        print(f"Blank response for {resp.url}.")
+        return False
     try:

         resp.raise_for_status()
@@ -22,12 +25,12 @@
         return None

 def parse_page(html):
-    products = html.css("li.VcGDfKky_dvNbxUqmZ9K")
+    products = html.css("li.VcGDfKKy_dvNbxUqm29K")
     for product in products:
         item = {
            "name": extract_text(product, "span[data-ui='product-title']"),
-            "price": extract_text(product, "span[data-ui=sale-price]"),
-            "savings": extract_text(product, "div[data-ui=savings-percent-variant2]"),
+            "price": extract_text(product, "span[data-ui='sale-price']"),
+            "savings": extract_text(product, "span[data-ui='savings']"),
         }
         yield item

Ran GitHub Actions for a11932f214bbae6a87d3848eb942327bedb1a016:

--- 
+++ 
@@ -4,9 +4,12 @@

 def get_html(baseurl, page):
     headers = {
-        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0"
+        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/111.0"
     }
     resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True)
+    if resp.text == '':
+        print(f"Blank response for {resp.url}.")
+        return False
     try:

         resp.raise_for_status()
@@ -22,12 +25,12 @@
         return None

 def parse_page(html):
-    products = html.css("li.VcGDfKky_dvNbxUqmZ9K")
+    products = html.css("li.VcGDfKKy_dvNbxUqm29K")
     for product in products:
         item = {
            "name": extract_text(product, "span[data-ui='product-title']"),
-            "price": extract_text(product, "span[data-ui=sale-price]"),
-            "savings": extract_text(product, "div[data-ui=savings-percent-variant2]"),
+            "price": extract_text(product, "span[data-ui='sale-price']"),
+            "savings": extract_text(product, "span[data-ui='savings']"),
         }
         yield item

@@ -37,10 +40,17 @@
         print(f"Gathering page: {x}")
         html = get_html(baseurl, x)
         if html is False:
+            # If getting HTML fails, log an error message and break from the loop to stop further processing
+            print(f'Error occurred when fetching page {x}. Stopping the scraping process.')
             break
         data = parse_page(html)
-        for item in data:
-            print(item)
+        # Open a file in append mode to save the product details
+        with open('product_details.txt', 'a') as file:
+            for item in data:
+                # Writing product details to the file
+                file.write(f'{item}\n')
+
+        # Delay between requests to avoid overloading the server
         time.sleep(1)

 if __name__ == "__main__":

Ran GitHub Actions for 006569abc2e62d93af8eca63998ca44f05ae7401:


Step 3: 🔁 Code Review

I have finished reviewing the code for completeness. I did not find errors for sweep/i_want_to_modify_webscraper_code.


🎉 Latest improvements to Sweep:


💡 To recreate the pull request edit the issue title or description. To tweak the pull request, leave a comment on the pull request. Join Our Discord