WayneRose-95 / Metacritic_Webscraper-

A side project on creating my own first web scraper.
GNU General Public License v3.0
0 stars 0 forks source link

Dealing with Null Values in the dataset #11

Closed WayneRose-95 closed 2 years ago

WayneRose-95 commented 2 years ago

The latest version of the scraper under the branch Scraper_Update_feb_22 currently scrapes all of the pages within the fighting games genre section of Metacritic.

However, despite this, some of the data is not being scraped, and is appearing as a Null value inside the output as shown below.

Git Issue #9 Missing Records

This is happening because the affected pages do not load fast enough for the scraper to collect the correct information.

There are a handful possible solutions for this:

  1. Run the scraper in headless mode to improve efficiency
  2. Remove the Null values via data cleaning in Pandas/Spark etc.
  3. Run a piece of code which matches the url to the null dataset, re-run the pages. Append these missing values to the end of the list of outputs.
WayneRose-95 commented 2 years ago

For now, the null values have been fixed by running an extra time.sleep() when the TimeoutException occurs in the page like so:

Git Issue #9 solution

This issue is solved, but could reappear again for a larger volume of data.