Closed shntnu closed 5 years ago
Fails on AppVeyor https://ci.appveyor.com/project/shntnu/cytominer-database/build/1.0.203/job/l0me0e7gl4axo88g
'(exceptions.ValueError) could not convert string to float: na'
but passes on Travis
interesting - this fails now with an error:
looks like we will need to configure pd.read_sql()
a bit more.
@shntnu @bethac07 is na
typically how a missing value will be coded in the .csv
files?
tests/test_ingest.py:37:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/io/sql.py:392: in read_sql
parse_dates=parse_dates, columns=columns, chunksize=chunksize)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/io/sql.py:1039: in read_table
chunksize=chunksize)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/io/sql.py:731: in read
self._harmonize_columns(parse_dates=parse_dates)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/io/sql.py:853: in _harmonize_columns
self.frame[col_name] = df_col.astype(col_type, copy=False)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/generic.py:5691: in astype
**kwargs)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/internals/managers.py:531: in astype
return self.apply('astype', dtype=dtype, **kwargs)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/internals/managers.py:395: in apply
applied = getattr(b, f)(**kwargs)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/internals/blocks.py:534: in astype
**kwargs)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/internals/blocks.py:633: in _astype
values = astype_nansafe(values.ravel(), dtype, copy=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arr = array([1.0536865809980278, 1.1503667249090357, 1.0698111464164657,
1.03... 1.108436713089547, 1.395106670685756, 1.2661806217001421],
dtype=object)
dtype = dtype('float64'), copy = True, skipna = False
def astype_nansafe(arr, dtype, copy=True, skipna=False):
"""
Cast the elements of an array to a given dtype a nan-safe manner.
Parameters
----------
arr : ndarray
dtype : np.dtype
copy : bool, default True
If False, a view will be attempted but may fail, if
e.g. the item sizes don't align.
skipna: bool, default False
Whether or not we should skip NaN when casting as a string-type.
Raises
------
ValueError
The dtype was a datetime64/timedelta64 dtype, but it had no unit.
"""
# dispatch on extension dtype if needed
if is_extension_array_dtype(dtype):
return dtype.construct_array_type()._from_sequence(
arr, dtype=dtype, copy=copy)
if not isinstance(dtype, np.dtype):
dtype = pandas_dtype(dtype)
if issubclass(dtype.type, text_type):
# in Py3 that's str, in Py2 that's unicode
return lib.astype_unicode(arr.ravel(),
skipna=skipna).reshape(arr.shape)
elif issubclass(dtype.type, string_types):
return lib.astype_str(arr.ravel(),
skipna=skipna).reshape(arr.shape)
elif is_datetime64_dtype(arr):
if is_object_dtype(dtype):
return tslib.ints_to_pydatetime(arr.view(np.int64))
elif dtype == np.int64:
return arr.view(dtype)
# allow frequency conversions
if dtype.kind == 'M':
return arr.astype(dtype)
raise TypeError("cannot astype a datetimelike from [{from_dtype}] "
"to [{to_dtype}]".format(from_dtype=arr.dtype,
to_dtype=dtype))
elif is_timedelta64_dtype(arr):
if is_object_dtype(dtype):
return tslibs.ints_to_pytimedelta(arr.view(np.int64))
elif dtype == np.int64:
return arr.view(dtype)
# in py3, timedelta64[ns] are int64
if ((PY3 and dtype not in [_INT64_DTYPE, _TD_DTYPE]) or
(not PY3 and dtype != _TD_DTYPE)):
# allow frequency conversions
# we return a float here!
if dtype.kind == 'm':
mask = isna(arr)
result = arr.astype(dtype).astype(np.float64)
result[mask] = np.nan
return result
elif dtype == _TD_DTYPE:
return arr.astype(_TD_DTYPE, copy=copy)
raise TypeError("cannot astype a timedelta from [{from_dtype}] "
"to [{to_dtype}]".format(from_dtype=arr.dtype,
to_dtype=dtype))
elif (np.issubdtype(arr.dtype, np.floating) and
np.issubdtype(dtype, np.integer)):
if not np.isfinite(arr).all():
raise ValueError('Cannot convert non-finite values (NA or inf) to '
'integer')
elif is_object_dtype(arr):
# work around NumPy brokenness, #1987
if np.issubdtype(dtype.type, np.integer):
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
# if we have a datetime/timedelta array of objects
# then coerce to a proper dtype and recall astype_nansafe
elif is_datetime64_dtype(dtype):
from pandas import to_datetime
return astype_nansafe(to_datetime(arr).values, dtype, copy=copy)
elif is_timedelta64_dtype(dtype):
from pandas import to_timedelta
return astype_nansafe(to_timedelta(arr).values, dtype, copy=copy)
if dtype.name in ("datetime64", "timedelta64"):
msg = ("The '{dtype}' dtype has no unit. "
"Please pass in '{dtype}[ns]' instead.")
raise ValueError(msg.format(dtype=dtype.name))
if copy or is_object_dtype(arr) or is_object_dtype(dtype):
# Explicit copy, or required since NumPy can't view from / to object.
> return arr.astype(dtype, copy=True)
E ValueError: could not convert string to float: na
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/dtypes/cast.py:702: ValueError
It should be NaN
Actually, appears to be nan
in CP 3+
great! Thanks Beth!
I changed na
to nan
in 70184d383dca649d6259b9c385267ac180a239e3 using:
file = "tests/data_b/B01-2/Cells.csv"
df = pd.read_csv(file)
df.loc[4, "AreaShape_Compactness"] = "nan"
df.to_csv(file, sep=',', index=False)
This seems to have worked, although I am not quite sure why (haven't searched much yet). Note that the change in 70184d383dca649d6259b9c385267ac180a239e3 seems to have updated the amount of significant digits in some features, which is why it seems like the entire csv
file was updated.
Note that the change in 70184d3 seems to have updated the amount of significant digits in some features, which is why it seems like the entire
csv
file was updated.
I reverted the change, then remade it by hand, to reduce diffs. Hope that's ok @gwaygenomics
This seems to have worked, although I am not quite sure why (haven't searched much yet)
IIRC it worked ok on most machines.
Also this: https://github.com/cytomining/cytominer-database/pull/104#issuecomment-391873764
So this might be a tough to reproduce. I verified this worked fine on sqlite v3.23.1 on OSX and 3.8.2 on linux 3.13.0-151-generic. It passes on Travis. I'd say that's good enough, given that this error is hard to reproduce.
I reverted the change, then remade it by hand, to reduce diffs. Hope that's ok @gwaygenomics
Sounds great! Thanks @shntnu
This seems to have worked, although I am not quite sure why (haven't searched much yet)
I think it is just that pandas knows to convert nan
to a missing value, but not na
. Since cellprofiler outputs nan
I think this is good to merge!
I have a hunch this https://stackoverflow.com/questions/15569745/store-nan-values-in-sqlite-database will become relevant at some point so adding that reference here
Fixes #103
https://github.com/cytomining/cytominer-database/pull/104#issuecomment-511440383 explains why this worked without needing to change any code