SauceCat / pydqc

python automatic data quality check toolkit
MIT License
285 stars 57 forks source link

pd.read_excel() on the output of infer_schema.xlsx returns AssertionError: #4

Open sugiantolauw opened 6 years ago

sugiantolauw commented 6 years ago

Hi There,

This looks like a great package and I was testing your package for my own automation.

So, I have created the schema of my dataframe with the infer_schema() function and it returns the .xlsx file.

When I tried to read the .xlsx schema file using pd.read_excel() function, it returns "AssertionError: "

I am not sure what is happening here.

Here is the complete error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-29-e35830cdc292> in <module>()
----> 1 test_schema = pd.read_excel('output/data_schema_test_schema.xlsx')

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in read_excel(io, sheet_name, header, skiprows, skip_footer, index_col, names, usecols, parse_dates, date_parser, na_values, thousands, convert_float, converters, dtype, true_values, false_values, engine, squeeze, **kwds)
    228 
    229     if not isinstance(io, ExcelFile):
--> 230         io = ExcelFile(io, engine=engine)
    231 
    232     return io._parse_excel(

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/pandas/io/excel.py in __init__(self, io, **kwds)
    292             self.book = xlrd.open_workbook(file_contents=data)
    293         elif isinstance(self._io, compat.string_types):
--> 294             self.book = xlrd.open_workbook(self._io)
    295         else:
    296             raise ValueError('Must explicitly set engine if not passing in'

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    141                 formatting_info=formatting_info,
    142                 on_demand=on_demand,
--> 143                 ragged_rows=ragged_rows,
    144                 )
    145             return bk

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    835         x12sheet = X12Sheet(sheet, logfile, verbosity)
    836         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 837         x12sheet.process_stream(zflo, heading)
    838         del zflo
    839 

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in own_process_stream(self, stream, heading)
    546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
--> 548                 self_do_row(elem)
    549                 elem.clear() # destroy all child elements (cells)
    550             elif elem.tag == U_SSML12 + "dimension":

~/anaconda3/envs/py36tensor/lib/python3.6/site-packages/xlrd/xlsx.py in do_row(self, row_elem)
    633             self.rowx = int(row_number) - 1
    634             explicit_row_number = 1
--> 635         assert 0 <= self.rowx < X12_MAX_ROWS
    636         rowx = self.rowx
    637         colx = -1

AssertionError: 

Here is what I have installed on my python: airflow==1.8.0 alembic==0.8.10 annoy==1.11.1 asn1crypto==0.24.0 attrs==17.4.0 Automat==0.6.0 awscli==1.14.32 basemap==1.0.7 bcolz==1.1.2 beautifulsoup4==4.6.0 bitarray==0.8.1 bleach==1.5.0 bokeh==0.12.14 boto==2.48.0 boto3==1.5.22 botocore==1.8.36 branca==0.2.0 bs4==0.0.1 bz2file==0.98 cachetools==2.0.1 Cartopy==0.15.1 catboost==0.6 certifi==2018.1.18 cffi==1.11.4 chardet==3.0.4 click==6.7 click-plugins==1.0.3 cliff==2.8.0 cligj==0.4.0 cloudpickle==0.5.2 cmd2==0.8.0 colorama==0.3.7 colorcet==1.0.0 configparser==3.5.0 constantly==15.1.0 croniter==0.3.20 cryptography==2.1.4 cssselect==1.0.3 cycler==0.10.0 cymem==1.31.2 Cython==0.27.3 cytoolz==0.8.2 dask==0.17.0 datashader==0.6.5 datashape==0.5.4 deap==1.2.2 decorator==4.2.1 descartes==1.1.0 dill==0.2.7.1 distributed==1.21.0 Django==1.11.8 docutils==0.14 en-core-web-sm==2.0.0 entrypoints==0.2.3 et-xmlfile==1.0.1 fastai==0.6 feather-format==0.4.0 Fiona==1.7.11.post1 Flask==0.11.1 Flask-Admin==1.4.1 Flask-Cache==0.13.1 Flask-Login==0.2.11 flask-swagger==0.2.13 Flask-WTF==0.12 folium==0.5.0 ftfy==4.4.3 funcsigs==1.0.0 future==0.15.2 gensim==3.2.0 geopandas==0.3.0 geoplot==0.1.2 gitdb2==2.0.3 GitPython==2.1.8 graphviz==0.8.2 gunicorn==19.3.0 h5py==2.7.1 heapdict==1.0.0 html5lib==0.9999999 hyperlink==17.3.1 idna==2.6 ijson==2.3 impyla==0.14.0 incremental==17.5.0 ipykernel==4.8.0 ipython==6.2.1 ipython-genutils==0.2.0 ipywidgets==7.1.1 isoweek==1.3.3 itsdangerous==0.24 JayDeBeApi==1.1.1 jdcal==1.3 jedi==0.11.1 Jinja2==2.8.1 jmespath==0.9.3 JPype1==0.6.2 jsonschema==2.6.0 jupyter==1.0.0 jupyter-client==5.2.2 jupyter-console==5.2.0 jupyter-contrib-core==0.3.3 jupyter-contrib-nbextensions==0.4.0 jupyter-core==4.4.0 jupyter-highlight-selected-word==0.1.0 jupyter-latex-envs==1.4.1 jupyter-nbextensions-configurator==0.4.0 kaggle-cli==0.12.13 Keras==2.1.3 ktext==0.31 lightgbm==2.1.0 llvmlite==0.21.0 locket==0.2.0 lockfile==0.12.2 lxml==3.8.0 Mako==1.0.7 Markdown==2.6.9 MarkupSafe==1.0 matplotlib==2.1.2 MechanicalSoup==0.8.0 missingno==0.4.0 mistune==0.8.3 mizani==0.4.4 more-itertools==4.1.0 msgpack==0.5.1 msgpack-numpy==0.4.1 msgpack-python==0.5.1 multipledispatch==0.4.9 multiprocess==0.70.5 munch==2.2.0 murmurhash==0.28.0 nbconvert==5.3.1 nbformat==4.4.0 networkx==2.1 nltk==3.2.5 nose==1.3.7 notebook==5.3.1 numba==0.36.2 numpy==1.14.0 olefile==0.45.1 opencv-python==3.4.0.12 openpyxl==2.5.0 ordereddict==1.1 packaging==16.8 palettable==3.1.0 pandas==0.22.0 pandas-summary==0.0.41 pandocfilters==1.4.2 param==1.5.1 parsel==1.3.1 parso==0.1.1 partd==0.3.8 pathlib==1.0.1 pathos==0.2.1 patsy==0.5.0 pbr==3.1.1 pexpect==4.3.1 pickleshare==0.7.4 Pillow==5.0.0 plac==0.9.6 plotly==2.3.0 plotnine==0.3.0 ply==3.10 pox==0.2.3 ppft==1.6.4.7.1 preshed==1.0.0 prettytable==0.7.2 progressbar2==3.34.3 prompt-toolkit==1.0.15 protobuf==3.4.1 psutil==5.4.3 ptyprocess==0.5.2 pyarrow==0.8.0 pyasn1==0.4.2 pyasn1-modules==0.2.1 pycparser==2.18 PyDispatcher==2.0.5 pydot==1.2.4 pydqc==0.1 pyemd==0.5.1 Pygments==2.2.0 pyodbc==4.0.22 pyOpenSSL==17.5.0 pyparsing==2.2.0 pyperclip==1.6.0 Pyphen==0.9.4 pyproj==1.9.5.1 pyshp==1.2.12 python-daemon==2.1.2 python-dateutil==2.6.1 python-editor==1.0.3 python-Levenshtein==0.12.0 python-nvd3==0.14.2 python-slugify==1.1.4 python-utils==2.2.0 pytz==2017.3 PyYAML==3.12 pyzmq==16.0.3 qtconsole==4.3.1 queuelib==1.4.2 regex==2017.4.5 requests==2.18.4 rsa==3.4.2 s3transfer==0.1.12 scikit-learn==0.19.1 scipy==1.0.0 Scrapy==1.5.0 seaborn==0.8.1 selenium==3.8.1 Send2Trash==1.4.2 service-identity==17.0.0 setproctitle==1.1.10 Shapely==1.6.4.post1 simplegeneric==0.8.1 six==1.11.0 sklearn-pandas==1.6.0 smart-open==1.5.6 smmap2==2.0.3 sortedcontainers==1.5.9 spacy==2.0.6 SQLAlchemy==1.2.2 statsmodels==0.8.0 stevedore==1.28.0 stopit==1.1.1 tabulate==0.7.7 tblib==1.3.2 tensorflow==1.4.1 tensorflow-tensorboard==0.1.5 teradata==15.10.0.21 termcolor==1.1.0 terminado==0.8.1 testpath==0.3.1 textacy==0.5.0 Theano==1.0.1 thinc==6.10.2 thrift==0.9.3 thriftpy==0.3.9 toolz==0.9.0 torchtext==0.2.1 tornado==4.5.3 TPOT==0.9.2 tqdm==4.19.5 traitlets==4.3.2 Twisted==17.9.0 ujson==1.35 Unidecode==1.0.22 update-checker==0.16 urllib3==1.22 w3lib==1.19.0 wcwidth==0.1.7 webencodings==0.5.1 Werkzeug==0.14.1 widgetsnbextension==3.1.0 wrapt==1.10.11 WTForms==2.1 xarray==0.10.0 xgboost==0.7 xlrd==1.1.0 XlsxWriter==1.0.2 zict==0.1.3 zope.deprecation==4.3.0 zope.interface==4.4.3

Let me know your thoughts and if you need more info.

Sugi

SauceCat commented 6 years ago

Hi @sugiantolauw

Seems the error is from xlrd .. Please Google "xlrd assert 0 <= self.rowx < X12_MAX_ROWS".

rohanneps commented 6 years ago

Hi @sugiantolauw,

The errors roots from the fact that the infer_schema function generates data schema as an .xlsx file containing lots of unnecessary columns. Manually selecting and deleting all the remaining unused columns solved the issue for me.

@SauceCat, Could you look into the issue, as I am not much familiar with openpyxl. Thanks,

vishaalkk commented 6 years ago

@sugiantolauw @rohanneps I was running into same issue, and for now openpyxl library helped. Since I need just first three columns, I am slicing those only.

from openpyxl import load_workbook

def iter_rows(ws):
    for row in ws.iter_rows():
        yield [cell.value for cell in row[0:3]]

def get_schema(fname):
    wb = load_workbook(fname) #from openpyxl import load_workbook
    ws = wb.active
    schema = pd.DataFrame(iter_rows(ws))
    schema.columns = schema.iloc[0]
    schema = schema.reindex(schema.index.drop(0))
    return schema
chrisgschon commented 6 years ago

You can also solve this by saving the edited output workbook as a csv and then use pd.read_csv instead.

bballamudi commented 4 years ago

@SauceCat Can you please let me know if you have a fix for this?

JingyuZHANG commented 4 years ago

you can change your code from

test_schema = pd.read_excel('output/data_schema_test_schema.xlsx')

to

test_schema = pd.read_excel('output/data_schema_test_schema.xlsx', engine='openpyxl')