DFY-NCSU / slash-phase6

Slash is a web application that scrapes the most popular e-commerce websites for the best deals so that you can get the best value for your money!
MIT License
3 stars 3 forks source link

Target Scraper Not Retrieving Information from Target Web Pages #13

Closed IMYXR closed 2 days ago

IMYXR commented 2 days ago

Describe the bug The Target scraper program fails to retrieve information from Target's web pages. Attempts to access page data result in empty or null values, indicating that the scraper may not be correctly parsing or handling Target's HTML structure or access restrictions.

To Reproduce Steps to reproduce the behavior:

  1. Run the Target scraper program with the intended URL(s).
  2. Observe that no relevant page data is returned.

Expected behavior The scraper should retrieve and display the desired product or page information from Target's website.

Actual Behavior The scraper outputs null or empty data fields, failing to retrieve page information as expected.

Possible Causes

Environment

Additional context Recent changes in Target's website structure or known blocking mechanisms, could help in diagnosing and resolving the issue.uggestions for updating parsing logic, implementing user-agent rotation, or handling bot detection could be helpful.

IMYXR commented 2 days ago

Enviroments: altair==4.2.2 anyio==3.3.4 asgiref==3.6.0 astor==0.8.1 attrs==23.1.0 base58==2.1.1 bcrypt==3.2.0 beautifulsoup4==4.10.0 blinker==1.6.2 cachetools==5.3.1 certifi==2021.10.8 cffi==1.16.0 charset-normalizer==2.0.7 click==7.1.2 CurrencyConverter==0.17.11 ebaysdk==2.2.0 ecdsa==0.18.0 entrypoints==0.4 fastapi==0.70.0 gitdb==4.0.10 GitPython==3.1.36 greenlet==3.0.0 h11==0.14.0 idna==3.3 importlib-metadata==6.8.0 Jinja2==3.1.2 jsonschema==4.19.0 jsonschema-specifications==2023.7.1 lxml==4.9.3 markdown-it-py==3.0.0 MarkupSafe==2.1.3 mdurl==0.1.2 nest-asyncio==1.5.1 numpy==1.26.0 packaging==23.1 pandas==2.1.0 passlib==1.7.4 Pillow==10.0.1 protobuf==3.20.1 psycopg2-binary==2.9.3 pyarrow==13.0.0 pyasn1==0.5.0 pycparser==2.21 pydantic==1.8.2 pydeck==0.8.1b0 Pygments==2.16.1 PyMySQL==1.0.2 pyshorteners==1.0.1 python-dateutil==2.8.2 python-jose==3.3.0 python-multipart==0.0.5 pytz==2023.3.post1 referencing==0.30.2 requests==2.31.0 rich==13.6.0 rpds-py==0.10.3 rsa==4.9 six==1.16.0 smmap==5.0.0 sniffio==1.2.0 soupsieve==2.2.1 SQLAlchemy==1.4.32 starlette==0.16.0 streamlit==1.27.2 tabulate==0.8.9 tenacity==8.2.3 toml==0.10.2 toolz==0.12.0 tornado==6.3.3 typing_extensions==4.8.0 tzdata==2023.3 tzlocal==5.0.1 urllib3==1.26.7 uvicorn==0.15.0 validators==0.22.0 watchdog==3.0.0 zipp==3.17.0

IMYXR commented 2 days ago

Description

The static Target scraper, designed to retrieve information from Target's web pages using a combination of static scraping and Selenium for automated browser simulation, is unable to successfully fetch valid data. The Target website appears to detect the automated scraping, leading to anti-bot measures that return fake data instead of the expected page content.

Steps to Reproduce

  1. Run the static scraper program on a Target page URL using Selenium to automate browser behavior.
  2. Observe the returned data, which does not match the actual content on Target's webpage, indicating the application of anti-bot techniques.

Expected Behavior

The scraper should retrieve and display accurate information from Target's webpage, including product details or relevant page content.

Environment

Same as previous but add selenium

Actual Behavior

The data returned by the scraper is manipulated and does not match the expected information, indicating that Target’s anti-bot system has detected the scraping activity and is delivering false or placeholder data.

Possible Causes

Additional Context

Efforts to modify the headers, use different proxies, or implement a delay between requests have not resolved the issue, suggesting Target's anti-bot system is robust and may require a different approach.

Proposed Solution (Optional)

Suggestions to overcome anti-bot measures might include:

IMYXR commented 2 days ago

Finished this problem with this code #14