Closed Hardeepex closed 10 months ago
8422ac6aec
)[!TIP] I'll email you at hardeep.ex@gmail.com when I complete this pull request!
Here are the sandbox execution logs prior to making any changes:
b24f724
Checking src/scraping.py for syntax errors... ✅ src/scraping.py has no syntax errors!
1/1 ✓Checking src/scraping.py for syntax errors... ✅ src/scraping.py has no syntax errors!
Sandbox passed on the latest main
, so sandbox checks will be enabled for this issue.
I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.
src/scraping.py
✓ https://github.com/Hardeepex/webscraper/commit/aedc0716be9900fb3a470aed1a78172a9ac42088 Edit
Modify src/scraping.py with contents:
• Update the `get_html` function to handle any additional exceptions that may be causing the blank output. This could involve catching and handling more specific exceptions, or adding additional error checking after the `httpx.get` call.
• If necessary, update the User-Agent header to match the one used in the `src/singleproduct.py` file or another valid User-Agent string. Some websites may block or limit requests from unrecognized or suspicious User-Agents, which could be causing the blank output.
--- +++ @@ -4,9 +4,12 @@ def get_html(baseurl, page): headers = { - "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0" + "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/111.0" } resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True) + if resp.text == '': + print(f"Blank response for {resp.url}.") + return False try: resp.raise_for_status()
src/scraping.py
✓ Edit
Check src/scraping.py with contents:
Ran GitHub Actions for aedc0716be9900fb3a470aed1a78172a9ac42088:
src/scraping.py
✓ https://github.com/Hardeepex/webscraper/commit/a260716eec4a598b258a3c8cc0a449ebd3ab1b0f Edit
Modify src/scraping.py with contents:
• Update the `extract_text` function to use the correct CSS selectors for extracting product details from the HTML. The correct selectors can likely be found in the `src/singleproduct.py` file or the `data.txt` file.
• Ensure that the function correctly handles cases where the desired element is not found in the HTML, returning `None` or an appropriate default value.
--- +++ @@ -4,9 +4,12 @@ def get_html(baseurl, page): headers = { - "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0" + "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/111.0" } resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True) + if resp.text == '': + print(f"Blank response for {resp.url}.") + return False try: resp.raise_for_status() @@ -26,8 +29,8 @@ for product in products: item = { "name": extract_text(product, "span[data-ui='product-title']"), - "price": extract_text(product, "span[data-ui=sale-price]"), - "savings": extract_text(product, "div[data-ui=savings-percent-variant2]"), + "price": extract_text(product, "span[data-ui='sale-price']"), + "savings": extract_text(product, "span[data-ui='savings']"), } yield item
src/scraping.py
✓ Edit
Check src/scraping.py with contents:
Ran GitHub Actions for a260716eec4a598b258a3c8cc0a449ebd3ab1b0f:
src/scraping.py
✓ https://github.com/Hardeepex/webscraper/commit/a11932f214bbae6a87d3848eb942327bedb1a016 Edit
Modify src/scraping.py with contents:
• Update the `parse_page` function to use the correct CSS selector for selecting product elements from the HTML. The correct selector can likely be found in the `src/singleproduct.py` file or the `data.txt` file.
• Ensure that the function correctly extracts all desired details (name, price, savings) from each product element and yields a dictionary with these details.
--- +++ @@ -4,9 +4,12 @@ def get_html(baseurl, page): headers = { - "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0" + "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/111.0" } resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True) + if resp.text == '': + print(f"Blank response for {resp.url}.") + return False try: resp.raise_for_status() @@ -22,12 +25,12 @@ return None def parse_page(html): - products = html.css("li.VcGDfKky_dvNbxUqmZ9K") + products = html.css("li.VcGDfKKy_dvNbxUqm29K") for product in products: item = { "name": extract_text(product, "span[data-ui='product-title']"), - "price": extract_text(product, "span[data-ui=sale-price]"), - "savings": extract_text(product, "div[data-ui=savings-percent-variant2]"), + "price": extract_text(product, "span[data-ui='sale-price']"), + "savings": extract_text(product, "span[data-ui='savings']"), } yield item
src/scraping.py
✓ Edit
Check src/scraping.py with contents:
Ran GitHub Actions for a11932f214bbae6a87d3848eb942327bedb1a016:
src/scraping.py
✓ https://github.com/Hardeepex/webscraper/commit/006569abc2e62d93af8eca63998ca44f05ae7401 Edit
Modify src/scraping.py with contents:
• Update the `main` function to correctly handle the output from the `parse_page` function. This could involve printing the product details to the console, saving them to a file, or some other form of output.
• Ensure that the function correctly handles cases where the `get_html` function returns `False`, indicating an error. This could involve breaking the loop, logging an error message, or some other form of error handling.
--- +++ @@ -4,9 +4,12 @@ def get_html(baseurl, page): headers = { - "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0" + "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/111.0" } resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True) + if resp.text == '': + print(f"Blank response for {resp.url}.") + return False try: resp.raise_for_status() @@ -22,12 +25,12 @@ return None def parse_page(html): - products = html.css("li.VcGDfKky_dvNbxUqmZ9K") + products = html.css("li.VcGDfKKy_dvNbxUqm29K") for product in products: item = { "name": extract_text(product, "span[data-ui='product-title']"), - "price": extract_text(product, "span[data-ui=sale-price]"), - "savings": extract_text(product, "div[data-ui=savings-percent-variant2]"), + "price": extract_text(product, "span[data-ui='sale-price']"), + "savings": extract_text(product, "span[data-ui='savings']"), } yield item @@ -37,10 +40,17 @@ print(f"Gathering page: {x}") html = get_html(baseurl, x) if html is False: + # If getting HTML fails, log an error message and break from the loop to stop further processing + print(f'Error occurred when fetching page {x}. Stopping the scraping process.') break data = parse_page(html) - for item in data: - print(item) + # Open a file in append mode to save the product details + with open('product_details.txt', 'a') as file: + for item in data: + # Writing product details to the file + file.write(f'{item}\n') + + # Delay between requests to avoid overloading the server time.sleep(1) if __name__ == "__main__":
src/scraping.py
✓ Edit
Check src/scraping.py with contents:
Ran GitHub Actions for 006569abc2e62d93af8eca63998ca44f05ae7401:
I have finished reviewing the code for completeness. I did not find errors for sweep/i_want_to_modify_webscraper_code
.
💡 To recreate the pull request edit the issue title or description. To tweak the pull request, leave a comment on the pull request. Join Our Discord
Details
This is the webscraping code
import httpx from selectolax.parser import HTMLParser import time
def get_html(baseurl, page): headers = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0" } resp = httpx.get(baseurl + str(page), headers=headers, follow_redirects=True) try:
def extract_text(html, sel): try: return html.css_first(sel).text() except AttributeError: return None
def parse_page(html): products = html.css("li.VcGDfKky_dvNbxUqmZ9K") for product in products: item = { "name": extract_text(product, "span[data-ui='product-title']"), "price": extract_text(product, "span[data-ui=sale-price]"), "savings": extract_text(product, "div[data-ui=savings-percent-variant2]"), } yield item
def main(): baseurl = "https://www.rei.com/c/camping-and-hiking/f/scd-deals?page=" for x in range(1, 100): print(f"Gathering page: {x}") html = get_html(baseurl, x) if html is False: break data = parse_page(html) for item in data: print(item) time.sleep(1)
if name == "main": main()
I have upload the data.txt file for html output which i got from #https://github.com/Hardeepex/webscraper/blob/main/src/rawyhtmlscraper.py
this is the single product scraper https://github.com/Hardeepex/webscraper/blob/main/src/singleproduct.py
i want to modify this scraper file https://github.com/Hardeepex/webscraper/blob/main/src/scraping.py
according to the data.txt file create a scraper for scrape the product details from all pages now mine scraper is able to explore the pages but dont know the output is blank
Checklist
- [X] Modify `src/scraping.py` ✓ https://github.com/Hardeepex/webscraper/commit/aedc0716be9900fb3a470aed1a78172a9ac42088 [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L5-L15) - [X] Running GitHub Actions for `src/scraping.py` ✓ [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L5-L15) - [X] Modify `src/scraping.py` ✓ https://github.com/Hardeepex/webscraper/commit/a260716eec4a598b258a3c8cc0a449ebd3ab1b0f [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L17-L21) - [X] Running GitHub Actions for `src/scraping.py` ✓ [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L17-L21) - [X] Modify `src/scraping.py` ✓ https://github.com/Hardeepex/webscraper/commit/a11932f214bbae6a87d3848eb942327bedb1a016 [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L23-L31) - [X] Running GitHub Actions for `src/scraping.py` ✓ [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L23-L31) - [X] Modify `src/scraping.py` ✓ https://github.com/Hardeepex/webscraper/commit/006569abc2e62d93af8eca63998ca44f05ae7401 [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L33-L43) - [X] Running GitHub Actions for `src/scraping.py` ✓ [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/i_want_to_modify_webscraper_code/src/scraping.py#L33-L43)