ageron / handson-ml3

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
7.84k stars 3.14k forks source link

[BUG] Chapter 2: Section "Download the data", buggy implementation for load_housing_data() function #156

Open ali-moameri opened 2 months ago

ali-moameri commented 2 months ago

The implementation for load_housing_data() is as following:

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

Based on this implementation if the file datasets/housing.tgz exists, it just reads the datasets/housing/housing.csv and returns. It may be a case that datasets/housing.tgz exists but datasets/housing/housing.csv dosen't. Therefor the code will run to FileNotFoundError. The correct implementation should be like this:

def load_housing_data():
  tarfile_path = Path(f'datasets/housing.tgz')

  if not tarfile_path.is_file():
    Path.mkdir(Path('datasets'), parents=True, exist_ok=True)
    response = requests.get('https://github.com/ageron/data/raw/main/housing.tgz')
    with open(tarfile_path, 'wb') as f:
      f.write(response.content)

  with tarfile.open(tarfile_path) as housing_tarball:
    housing_tarball.extractall(path="datasets")
  return pd.read_csv(Path("datasets/housing/housing.csv"))

If datasets/housing.tgz exists, extract and then read it. If it dosen't, download it, extract it and then read it.

Naseef03 commented 1 month ago

I saw the same one too. If you delete the housing folder then the code will throw an error at the read_csv part