Address SQLite handling of na

shntnu commented 6 years ago

Fixes #103

https://github.com/cytomining/cytominer-database/pull/104#issuecomment-511440383 explains why this worked without needing to change any code

shntnu commented 6 years ago

Fails on AppVeyor https://ci.appveyor.com/project/shntnu/cytominer-database/build/1.0.203/job/l0me0e7gl4axo88g

'(exceptions.ValueError) could not convert string to float: na'

but passes on Travis

gwaybio commented 5 years ago

interesting - this fails now with an error:

looks like we will need to configure pd.read_sql() a bit more.

@shntnu @bethac07 is na typically how a missing value will be coded in the .csv files?

tests/test_ingest.py:37: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/io/sql.py:392: in read_sql
    parse_dates=parse_dates, columns=columns, chunksize=chunksize)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/io/sql.py:1039: in read_table
    chunksize=chunksize)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/io/sql.py:731: in read
    self._harmonize_columns(parse_dates=parse_dates)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/io/sql.py:853: in _harmonize_columns
    self.frame[col_name] = df_col.astype(col_type, copy=False)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/generic.py:5691: in astype
    **kwargs)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/internals/managers.py:531: in astype
    return self.apply('astype', dtype=dtype, **kwargs)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/internals/managers.py:395: in apply
    applied = getattr(b, f)(**kwargs)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/internals/blocks.py:534: in astype
    **kwargs)
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/internals/blocks.py:633: in _astype
    values = astype_nansafe(values.ravel(), dtype, copy=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
arr = array([1.0536865809980278, 1.1503667249090357, 1.0698111464164657,
       1.03... 1.108436713089547, 1.395106670685756, 1.2661806217001421],
      dtype=object)
dtype = dtype('float64'), copy = True, skipna = False
    def astype_nansafe(arr, dtype, copy=True, skipna=False):
        """
        Cast the elements of an array to a given dtype a nan-safe manner.

        Parameters
        ----------
        arr : ndarray
        dtype : np.dtype
        copy : bool, default True
            If False, a view will be attempted but may fail, if
            e.g. the item sizes don't align.
        skipna: bool, default False
            Whether or not we should skip NaN when casting as a string-type.

        Raises
        ------
        ValueError
            The dtype was a datetime64/timedelta64 dtype, but it had no unit.
        """

        # dispatch on extension dtype if needed
        if is_extension_array_dtype(dtype):
            return dtype.construct_array_type()._from_sequence(
                arr, dtype=dtype, copy=copy)

        if not isinstance(dtype, np.dtype):
            dtype = pandas_dtype(dtype)

        if issubclass(dtype.type, text_type):
            # in Py3 that's str, in Py2 that's unicode
            return lib.astype_unicode(arr.ravel(),
                                      skipna=skipna).reshape(arr.shape)

        elif issubclass(dtype.type, string_types):
            return lib.astype_str(arr.ravel(),
                                  skipna=skipna).reshape(arr.shape)

        elif is_datetime64_dtype(arr):
            if is_object_dtype(dtype):
                return tslib.ints_to_pydatetime(arr.view(np.int64))
            elif dtype == np.int64:
                return arr.view(dtype)

            # allow frequency conversions
            if dtype.kind == 'M':
                return arr.astype(dtype)

            raise TypeError("cannot astype a datetimelike from [{from_dtype}] "
                            "to [{to_dtype}]".format(from_dtype=arr.dtype,
                                                     to_dtype=dtype))

        elif is_timedelta64_dtype(arr):
            if is_object_dtype(dtype):
                return tslibs.ints_to_pytimedelta(arr.view(np.int64))
            elif dtype == np.int64:
                return arr.view(dtype)

            # in py3, timedelta64[ns] are int64
            if ((PY3 and dtype not in [_INT64_DTYPE, _TD_DTYPE]) or
                    (not PY3 and dtype != _TD_DTYPE)):

                # allow frequency conversions
                # we return a float here!
                if dtype.kind == 'm':
                    mask = isna(arr)
                    result = arr.astype(dtype).astype(np.float64)
                    result[mask] = np.nan
                    return result
            elif dtype == _TD_DTYPE:
                return arr.astype(_TD_DTYPE, copy=copy)

            raise TypeError("cannot astype a timedelta from [{from_dtype}] "
                            "to [{to_dtype}]".format(from_dtype=arr.dtype,
                                                     to_dtype=dtype))

        elif (np.issubdtype(arr.dtype, np.floating) and
              np.issubdtype(dtype, np.integer)):

            if not np.isfinite(arr).all():
                raise ValueError('Cannot convert non-finite values (NA or inf) to '
                                 'integer')

        elif is_object_dtype(arr):

            # work around NumPy brokenness, #1987
            if np.issubdtype(dtype.type, np.integer):
                return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)

            # if we have a datetime/timedelta array of objects
            # then coerce to a proper dtype and recall astype_nansafe

            elif is_datetime64_dtype(dtype):
                from pandas import to_datetime
                return astype_nansafe(to_datetime(arr).values, dtype, copy=copy)
            elif is_timedelta64_dtype(dtype):
                from pandas import to_timedelta
                return astype_nansafe(to_timedelta(arr).values, dtype, copy=copy)

        if dtype.name in ("datetime64", "timedelta64"):
            msg = ("The '{dtype}' dtype has no unit. "
                   "Please pass in '{dtype}[ns]' instead.")
            raise ValueError(msg.format(dtype=dtype.name))

        if copy or is_object_dtype(arr) or is_object_dtype(dtype):
            # Explicit copy, or required since NumPy can't view from / to object.
>           return arr.astype(dtype, copy=True)
E           ValueError: could not convert string to float: na
../../../virtualenv/python2.7.14/lib/python2.7/site-packages/pandas/core/dtypes/cast.py:702: ValueError

bethac07 commented 5 years ago

It should be NaN

bethac07 commented 5 years ago

Actually, appears to be nan in CP 3+

gwaybio commented 5 years ago

great! Thanks Beth!

gwaybio commented 5 years ago

I changed na to nan in 70184d383dca649d6259b9c385267ac180a239e3 using:

file = "tests/data_b/B01-2/Cells.csv"
df = pd.read_csv(file)
df.loc[4, "AreaShape_Compactness"] = "nan"
df.to_csv(file, sep=',', index=False)

This seems to have worked, although I am not quite sure why (haven't searched much yet). Note that the change in 70184d383dca649d6259b9c385267ac180a239e3 seems to have updated the amount of significant digits in some features, which is why it seems like the entire csv file was updated.

shntnu commented 5 years ago

Note that the change in 70184d3 seems to have updated the amount of significant digits in some features, which is why it seems like the entire csv file was updated.

I reverted the change, then remade it by hand, to reduce diffs. Hope that's ok @gwaygenomics

shntnu commented 5 years ago

This seems to have worked, although I am not quite sure why (haven't searched much yet)

IIRC it worked ok on most machines.

Also this: https://github.com/cytomining/cytominer-database/pull/104#issuecomment-391873764

So this might be a tough to reproduce. I verified this worked fine on sqlite v3.23.1 on OSX and 3.8.2 on linux 3.13.0-151-generic. It passes on Travis. I'd say that's good enough, given that this error is hard to reproduce.

gwaybio commented 5 years ago

I reverted the change, then remade it by hand, to reduce diffs. Hope that's ok @gwaygenomics

Sounds great! Thanks @shntnu

gwaybio commented 5 years ago

This seems to have worked, although I am not quite sure why (haven't searched much yet)

I think it is just that pandas knows to convert nan to a missing value, but not na. Since cellprofiler outputs nan I think this is good to merge!

shntnu commented 4 years ago

I have a hunch this https://stackoverflow.com/questions/15569745/store-nan-values-in-sqlite-database will become relevant at some point so adding that reference here

cytomining / cytominer-database

Address SQLite handling of na #104