Hardeepex / webscraper

1 stars 0 forks source link

sweep: tried to run the scraper but got the error python3 src/webscraper.py #19

Closed Hardeepex closed 9 months ago

Hardeepex commented 9 months ago

python3 src/webscraper.py

Access Denied

Access Denied

You don't have permission to access "http://www.rei.com/" on this server.

Reference #18.140a7c68.1704091323.387bf998

~/WebstormProjects/forbes

Checklist - [X] Modify `src/webscraper.py` ✓ https://github.com/Hardeepex/webscraper/commit/7d65b1bbd14d1d0395fefd997d8bdcea064c6f15 [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/tried_to_run_the_scraper_but_got_the_err/src/webscraper.py) - [X] Running GitHub Actions for `src/webscraper.py` ✓ [Edit](https://github.com/Hardeepex/webscraper/edit/sweep/tried_to_run_the_scraper_but_got_the_err/src/webscraper.py)
sweep-ai[bot] commented 9 months ago

🚀 Here's the PR! #20

See Sweep's progress at the progress dashboard!
💎 Sweep Pro: I'm using GPT-4. You have unlimited GPT-4 tickets. (tracking ID: 7ab651760d)

[!TIP] I'll email you at hardeep.ex@gmail.com when I complete this pull request!


Actions (click)

Sandbox Execution ✓

Here are the sandbox execution logs prior to making any changes:

Sandbox logs for 63f2a98
Checking src/webscraper.py for syntax errors... ✅ src/webscraper.py has no syntax errors! 1/1 ✓
Checking src/webscraper.py for syntax errors...
✅ src/webscraper.py has no syntax errors!

Sandbox passed on the latest main, so sandbox checks will be enabled for this issue.


Step 1: 🔎 Searching

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I think are relevant in decreasing order of relevance (click to expand). If some file is missing from here, you can mention the path in the ticket description. https://github.com/Hardeepex/webscraper/blob/63f2a98b42c0dcb2cdd7af44f819ce7cecc98b56/src/webscraper.py#L1-L12 https://github.com/Hardeepex/webscraper/blob/63f2a98b42c0dcb2cdd7af44f819ce7cecc98b56/README.md#L4-L18 https://github.com/Hardeepex/webscraper/blob/63f2a98b42c0dcb2cdd7af44f819ce7cecc98b56/src/rawyhtmlscraper.py#L1-L69 https://github.com/Hardeepex/webscraper/blob/63f2a98b42c0dcb2cdd7af44f819ce7cecc98b56/src/singleproduct.py#L1-L32 https://github.com/Hardeepex/webscraper/blob/63f2a98b42c0dcb2cdd7af44f819ce7cecc98b56/src/scraping.py#L1-L83 https://github.com/Hardeepex/webscraper/blob/63f2a98b42c0dcb2cdd7af44f819ce7cecc98b56/src/selenium_grid.py#L1-L14
I also found the following external resources that might be helpful: **Summaries of links found in the content:** http://www.rei.com: The page titled "Access Denied" states that the user does not have permission to access the website "http://www.rei.com/". The page also includes a reference number: #18.52a7cb17.1704091515.73bf2f49.

Step 2: ⌨️ Coding

--- 
+++ 
@@ -1,10 +1,20 @@
 import requests
+from src.selenium_grid import get_webdriver
 from bs4 import BeautifulSoup

 def scrape(url):
-    response = requests.get(url)
-    soup = BeautifulSoup(response.text, 'html.parser')
+    try:
+        driver = get_webdriver()
+        driver.get(url)
+        page_source = driver.page_source
+    except Exception as e:
+        print("Failed to get page using WebDriver. Instructions for building and running a"
+              " Selenium Grid Docker container can be found in the README.md file.")
+        print(str(e))
+        return None
+    else:
+        soup = BeautifulSoup(page_source, 'html.parser')
     return soup

 if __name__ == "__main__":

Ran GitHub Actions for 7d65b1bbd14d1d0395fefd997d8bdcea064c6f15:


Step 3: 🔁 Code Review

I have finished reviewing the code for completeness. I did not find errors for sweep/tried_to_run_the_scraper_but_got_the_err.


🎉 Latest improvements to Sweep:


💡 To recreate the pull request edit the issue title or description. To tweak the pull request, leave a comment on the pull request. Join Our Discord