Target Scraper Not Retrieving Information from Target Web Pages

IMYXR commented 2 days ago

Describe the bug The Target scraper program fails to retrieve information from Target's web pages. Attempts to access page data result in empty or null values, indicating that the scraper may not be correctly parsing or handling Target's HTML structure or access restrictions.

To Reproduce Steps to reproduce the behavior:

Run the Target scraper program with the intended URL(s).
Observe that no relevant page data is returned.

Expected behavior The scraper should retrieve and display the desired product or page information from Target's website.

Actual Behavior The scraper outputs null or empty data fields, failing to retrieve page information as expected.

Possible Causes

Environment

OS: Win11
Python version : python3.10
Dependencies: requests

Additional context Recent changes in Target's website structure or known blocking mechanisms, could help in diagnosing and resolving the issue.uggestions for updating parsing logic, implementing user-agent rotation, or handling bot detection could be helpful.

IMYXR commented 2 days ago

Enviroments: altair==4.2.2 anyio==3.3.4 asgiref==3.6.0 astor==0.8.1 attrs==23.1.0 base58==2.1.1 bcrypt==3.2.0 beautifulsoup4==4.10.0 blinker==1.6.2 cachetools==5.3.1 certifi==2021.10.8 cffi==1.16.0 charset-normalizer==2.0.7 click==7.1.2 CurrencyConverter==0.17.11 ebaysdk==2.2.0 ecdsa==0.18.0 entrypoints==0.4 fastapi==0.70.0 gitdb==4.0.10 GitPython==3.1.36 greenlet==3.0.0 h11==0.14.0 idna==3.3 importlib-metadata==6.8.0 Jinja2==3.1.2 jsonschema==4.19.0 jsonschema-specifications==2023.7.1 lxml==4.9.3 markdown-it-py==3.0.0 MarkupSafe==2.1.3 mdurl==0.1.2 nest-asyncio==1.5.1 numpy==1.26.0 packaging==23.1 pandas==2.1.0 passlib==1.7.4 Pillow==10.0.1 protobuf==3.20.1 psycopg2-binary==2.9.3 pyarrow==13.0.0 pyasn1==0.5.0 pycparser==2.21 pydantic==1.8.2 pydeck==0.8.1b0 Pygments==2.16.1 PyMySQL==1.0.2 pyshorteners==1.0.1 python-dateutil==2.8.2 python-jose==3.3.0 python-multipart==0.0.5 pytz==2023.3.post1 referencing==0.30.2 requests==2.31.0 rich==13.6.0 rpds-py==0.10.3 rsa==4.9 six==1.16.0 smmap==5.0.0 sniffio==1.2.0 soupsieve==2.2.1 SQLAlchemy==1.4.32 starlette==0.16.0 streamlit==1.27.2 tabulate==0.8.9 tenacity==8.2.3 toml==0.10.2 toolz==0.12.0 tornado==6.3.3 typing_extensions==4.8.0 tzdata==2023.3 tzlocal==5.0.1 urllib3==1.26.7 uvicorn==0.15.0 validators==0.22.0 watchdog==3.0.0 zipp==3.17.0

IMYXR commented 2 days ago

Description

The static Target scraper, designed to retrieve information from Target's web pages using a combination of static scraping and Selenium for automated browser simulation, is unable to successfully fetch valid data. The Target website appears to detect the automated scraping, leading to anti-bot measures that return fake data instead of the expected page content.

Steps to Reproduce

Run the static scraper program on a Target page URL using Selenium to automate browser behavior.
Observe the returned data, which does not match the actual content on Target's webpage, indicating the application of anti-bot techniques.

Expected Behavior

The scraper should retrieve and display accurate information from Target's webpage, including product details or relevant page content.

Environment

Same as previous but add selenium

Actual Behavior

The data returned by the scraper is manipulated and does not match the expected information, indicating that Target’s anti-bot system has detected the scraping activity and is delivering false or placeholder data.

Possible Causes

Target's website may be using sophisticated anti-bot measures that detect Selenium automation.
The static scraping program may need additional headers, IP rotation, or further automation techniques to bypass detection.
Target may be using CAPTCHA, user behavior monitoring, or bot fingerprinting to identify and block automated scraping tools.

Additional Context

Efforts to modify the headers, use different proxies, or implement a delay between requests have not resolved the issue, suggesting Target's anti-bot system is robust and may require a different approach.

Proposed Solution (Optional)

Suggestions to overcome anti-bot measures might include:

Implementing browser fingerprinting techniques to better mimic genuine user behavior.
Exploring further strategies for CAPTCHA bypass or human-like interaction within Selenium.

IMYXR commented 2 days ago

Finished this problem with this code #14

DFY-NCSU / slash-phase6