Building-ML-Pipelines / building-machine-learning-pipelines

Code repository for the O'Reilly publication "Building Machine Learning Pipelines" by Hannes Hapke & Catherine Nelson
MIT License
585 stars 250 forks source link

ValueError: Usecols do not match columns, columns expected but not found: ['company_response_to_consumer', 'zipcode', 'consumer_disputed?'] #4

Closed snehankekre closed 4 years ago

snehankekre commented 4 years ago

Bug

Set up of the demo project fails and throws a ValueError when following the instructions.

System details

Steps to reproduce

!pip install tfx !git clone https://github.com/Building-ML-Pipelines/building-machine-learning-pipelines.git !cd building-machine-learning-pipelines/;python3 utils/download_dataset.py

INFO:root:Started
INFO:root:Data folder created.
INFO:urllib3.poolmanager:Redirecting http://bit.ly/building-ml-pipelines-dataset -> https://drive.google.com/uc?export=download&id=1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
INFO:urllib3.poolmanager:Redirecting https://drive.google.com/uc?export=download&id=1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF -> https://doc-0o-8s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/s9hu87rhvef8qlae21p9rreoda7auml3/1594723575000/06616860426990197454/*/1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF?e=download
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
INFO:root:Download completed.
Traceback (most recent call last):
  File "utils/download_dataset.py", line 131, in <module>
    update_csv()
  File "utils/download_dataset.py", line 101, in update_csv
    df = pd.read_csv(LOCAL_FILE_NAME, usecols=feature_cols)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 448, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 880, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1114, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1937, in __init__
    _validate_usecols_names(usecols, self.orig_names)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1233, in _validate_usecols_names
    "Usecols do not match columns, "
ValueError: Usecols do not match columns, columns expected but not found: ['company_response_to_consumer', 'zipcode', 'consumer_disputed?']

Cause

In line 101 of utils/download_dataset.py, usecols looks for columns defined in features_cols within the consumer_complaints_with_narrative.csv dataset. It does not find the 'company_response_to_consumer', 'zipcode', 'consumer_disputed?' columns and throws a ValueError. A simple case of column name mismatch. The dataset actually contains the following column names:

df.columns
Index(['product', 'sub_product', 'issue', 'sub_issue',
       'consumer_complaint_narrative', 'company', 'state', 'zip_code',
       'company_response', 'timely_response', 'consumer_disputed'],
      dtype='object')

Fix

Update the column names in feature_cols and remove lines 103 - 110 in utils/download_dataset.py. i.e. Lines 88 through 110 can be replaced by the following:

feature_cols = [
        "product",
        "sub_product",
        "issue",
        "sub_issue",
        "state",
        "zip_code",
        "company",
        "company_response",
        "timely_response",
        "consumer_disputed",
        "consumer_complaint_narrative",
    ]
df = pd.read_csv(LOCAL_FILE_NAME, usecols=feature_cols)

Expected output

!cd building-machine-learning-pipelines/;python3 utils/download_dataset.py

INFO:root:Started
INFO:root:Data folder already existed.
INFO:urllib3.poolmanager:Redirecting http://bit.ly/building-ml-pipelines-dataset -> https://drive.google.com/uc?export=download&id=1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
INFO:urllib3.poolmanager:Redirecting https://drive.google.com/uc?export=download&id=1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF -> https://doc-0o-8s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/10faglfko9lihkhoq7mugfqlen9c30lu/1594725450000/06616860426990197454/*/1VHjb8L8n2d6eLz_lA-F-bk6Z0UecHpEF?e=download
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
INFO:root:Download completed.
INFO:root:CSV header updated and rewritten to data/tmp_consumer_complaints_with_narrative.csv
INFO:root:Finished
hanneshapke commented 4 years ago

Thank you for reporting this issue and for providing the PR. We have merged your PR and updated the download method regarding the SSL warnings. Please reopen the issue if you experience more issues with the download script.