CartoDB / cartoframes

CARTO Python package for data scientists
BSD 3-Clause "New" or "Revised" License
250 stars 63 forks source link

Dataset download fails when reading boolean column with nulls #732

Closed andy-esch closed 5 years ago

andy-esch commented 5 years ago

Reproducible code:

from cartoframes.auth import set_default_context
from cartoframes import Dataset

set_default_context(
    base_url='https://eschbacher.carto.com',
    api_key='default_public'
)

df = Dataset.from_table('seattle_collisions').download()
StackTrace --------------------------------------------------------------------------- ValueError Traceback (most recent call last) in ----> 1 df = Dataset.from_table('seattle_collisions').download() ~/git/CartoDB/cartoframes/cartoframes/dataset.py in download(self, limit, decode_geom, retry_times) 131 table_columns = self.get_table_columns() 132 query = self._get_read_query(table_columns, limit) --> 133 self.df = self.cc.fetch(query, decode_geom=decode_geom) 134 return self.df 135 ~/git/CartoDB/cartoframes/cartoframes/context.py in fetch(self, query, decode_geom) 525 false_values=['f'], 526 index_col='cartodb_id' if 'cartodb_id' in df_types else False, --> 527 converters={'the_geom': lambda x: _decode_geom(x) if decode_geom else x}) 528 529 if decode_geom: ~/.local/share/virtualenvs/cartoframes-pRWzcuWT/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 700 skip_blank_lines=skip_blank_lines) 701 --> 702 return _read(filepath_or_buffer, kwds) 703 704 parser_f.__name__ = name ~/.local/share/virtualenvs/cartoframes-pRWzcuWT/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 433 434 try: --> 435 data = parser.read(nrows) 436 finally: 437 parser.close() ~/.local/share/virtualenvs/cartoframes-pRWzcuWT/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 1137 def read(self, nrows=None): 1138 nrows = _validate_integer('nrows', nrows) -> 1139 ret = self._engine.read(nrows) 1140 1141 # May alter columns / col_dict ~/.local/share/virtualenvs/cartoframes-pRWzcuWT/lib/python3.7/site-packages/pandas/io/parsers.py in read(self, nrows) 1993 def read(self, nrows=None): 1994 try: -> 1995 data = self._reader.read(nrows) 1996 except StopIteration: 1997 if self._first_chunk: pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype() ValueError: Bool column has NA values in column 28

Not sure of the cause yet, but I have a feeling that pandas bool columns cannot have null values.

andy-esch commented 5 years ago

Yeah, removing boolean columns that have postgres nulls resolves the problem. Below, I just replaced null values with TRUE. It seems that pandas wants to convert bool columns that have null values to object dtypes instead.

REMOVE_COLS = {'inattentionind', 'speeding', 'pedrownotgrnt', }
cols = ','.join(set(cc.sql_client.send('select * from seattle_collisions limit 1')['fields']) - REMOVE_COLS)
cols = cols + ',' + ','.join(f'CASE WHEN {c} is null THEN TRUE ELSE {c} END AS {c}' for c in REMOVE_COLS)

df = Dataset.from_query(f'SELECT {cols} FROM seattle_collisions').download()
Jesus89 commented 5 years ago

Hi @andy-esch.

If we use dtype='object' it works without issues. If we use the column types we end up in this situation that pandas cannot solve.

We could detect the bool columns and add them in the converters param. This may be a more elegant solution that can be extended also to numeric values.

andy-esch commented 5 years ago

What would a column look like once processed by the converter?

I believe bool is a subset of int for Python/numpy arrays, so if we get a pandas int column that has nulls we have the same problem. I believe pandas will convert the int column to a float column (which allows nans). For bools it doesn't make sense to convert to float, so agree we should look at other options.

Jesus89 commented 5 years ago

I have tested the converter approah:

converters={
    'the_geom': lambda x: decode_geometry(x) if decode_geom else x,
    'inattentionind': lambda x: bool(x),
    'speeding': lambda x: bool(x),
    'pedrownotgrnt': lambda x: bool(x),
}

It works! but it throws also this message:

/home/jesus/code/carto/Trident/cartoframes/cartoframes/context.py:534: ParserWarning: Both a converter and dtype were specified for column inattentionind - only the converter will be used
  'pedrownotgrnt': lambda x: bool(x),
/home/jesus/code/carto/Trident/cartoframes/cartoframes/context.py:534: ParserWarning: Both a converter and dtype were specified for column pedrownotgrnt - only the converter will be used
  'pedrownotgrnt': lambda x: bool(x),
/home/jesus/code/carto/Trident/cartoframes/cartoframes/context.py:534: ParserWarning: Both a converter and dtype were specified for column speeding - only the converter will be used
  'pedrownotgrnt': lambda x: bool(x),

So maybe we can remove these columns from the dtype dict.

{'cartodb_id': 'int64', 'x': 'float64', 'y': 'float64', 'objectid': 'float64', 'inckey': 'float64', 'coldetkey': 'float64', 'reportno': 'object', 'status': 'object', 'addrtype': 'object', 'intkey': 'float64', 'location': 'object', 'exceptrsncode': 'object', 'exceptrsndesc': 'object', 'severitycode': 'object', 'severitydesc': 'object', 'collisiontype': 'object', 'personcount': 'float64', 'pedcount': 'float64', 'pedcylcount': 'float64', 'vehcount': 'float64', 'injuries': 'float64', 'fatalities': 'float64', 'incdttm': 'object', 'junctiontype': 'object', 'sdot_colcode': 'float64', 'sdot_coldesc': 'object', 'inattentionind': 'bool', 'underinfl': 'bool', 'weather': 'object', 'roadcond': 'object', 'lightcond': 'object', 'pedrownotgrnt': 'bool', 'sdotcolnum': 'object', 'speeding': 'bool', 'st_colcode': 'float64', 'st_coldesc': 'object', 'seglanekey': 'float64', 'crosswalkkey': 'float64', 'hitparkedcar': 'bool', 'seriousinjuries': 'float64'}
andy-esch commented 5 years ago

bool(None) evaluates to False, so I don't know if this is the best approach. We'd ideally want bool(None) == None

Jesus89 commented 5 years ago

Aha. So basically we need to use 'object' as a type for those columns that contain data + null, but we can not detect this before reading the table. What about loading everything as an object, and after reading the table, try to apply the types? :thinking:

andy-esch commented 5 years ago

That sounds good since it seems like the most lossless approach. The more I use it, the more I realize how quirky pandas is...

Jesus89 commented 5 years ago

Mmm, I see that everything is a string or a number.

Case 1

For the bool case, if we use object type we get:

2395      NaN
3042        t
3167      NaN
6023      NaN
196536      t
197277    NaN
197593    NaN
197661    NaN
197717      t

And after astype('bool') everything is true.

Case 2

If I use object as dtype and lambda x: x if not x else bool(x) as converter the result is:

2395          
3042      True
3167          
6023          
196536    True
197277        
197593        
197661        
197717    True

And after astype('bool')

2395      False
3042       True
3167      False
6023      False
196536     True
197277    False
197593    False
197661    False
197717     True
Jesus89 commented 5 years ago

Case 3

(The solution)

  1. Exclude the bools from the dtypes
  2. Use this converter lambda x: bool(x) if x else None for each boolean column

This generates a nice column with the proper type (bool if there are no Nones and object otherwise), as if you had created it like pd.DataFrame.from_dict({'test': [None, True, False]}).

IMO, this should be the default behavior of the read_csv method in pandas.

Jesus89 commented 5 years ago

Hi @andy-esch,

I have created here a PoC with the implementation of the solution: https://github.com/CartoDB/cartoframes/pull/760. It would be nice if you can try it :)

andy-esch commented 5 years ago

Great, I'll take a look today :)

Jesus89 commented 5 years ago
from cartoframes.auth import Context
from cartoframes.data import Dataset

c = Context('https://eschbacher.carto.com')

df = Dataset.from_table('seattle_collisions', context=c).download()

print(df['pedrownotgrnt'].dtype, df['underinfl'].dtype)
print(df['pedrownotgrnt'], df['underinfl'])

object bool cartodb_id 2395 None 3042 True 3167 None 6023 None 196536 True 197277 None 197593 None 197661 None 197717 True 197884 None 198651 None 199156 True 198115 None 198363 None 198884 None 200425 None 202891 None 203890 None 204099 None 205536 None 205782 None 194477 True 197328 None 206738 None 207012 None 198164 None 15 None 32 None 344 None 515 None ... 207128 None 207129 None 207130 None 207131 None 207132 None 207133 None 207134 None 207135 None 207137 None 207138 None 207139 None 207140 None 207142 None 207143 None 207144 None 207145 None 207146 None 207147 None 207148 None 207149 None 207151 None 207153 None 207154 None 207155 True 207156 None 207159 None 207160 None 202943 None 203025 None 206093 None Name: pedrownotgrnt, Length: 10012, dtype: object cartodb_id 2395 True 3042 True 3167 True 6023 True 196536 True 197277 True 197593 True 197661 True 197717 True 197884 True 198651 True 199156 True 198115 True 198363 True 198884 True 200425 True 202891 True 203890 True 204099 True 205536 True 205782 True 194477 True 197328 True 206738 True 207012 True 198164 True 15 True 32 True 344 True 515 True ... 207128 True 207129 True 207130 True 207131 True 207132 True 207133 True 207134 True 207135 True 207137 True 207138 True 207139 True 207140 True 207142 True 207143 True 207144 True 207145 True 207146 True 207147 True 207148 True 207149 True 207151 True 207153 True 207154 True 207155 True 207156 True 207159 True 207160 True 202943 True 203025 True 206093 True Name: underinfl, Length: 10012, dtype: bool

Jesus89 commented 5 years ago

Fixed in #760