JDASoftwareGroup / kartothek

A consistent table management library in python
https://kartothek.readthedocs.io/en/stable
MIT License
161 stars 53 forks source link

Partition on drops data containing NaN #262

Closed brl0 closed 4 years ago

brl0 commented 4 years ago

Problem description

When partitioning on a column, rows containing NaN in that column are dropped silently.

It would be nice if there was some sort of warning.
IMO, an even better solution would be to fillna and provide a warning about difference in round-trip.

Example code (ideally copy-pastable)

Please provide a minimal reproducible code example to reproduce the behavior, c.f. https://stackoverflow.com/help/minimal-reproducible-example

import dask.dataframe as dd
import pandas as pd
import numpy as np
from functools import partial
from tempfile import TemporaryDirectory
from storefact import get_store_from_url
from kartothek.io.eager import store_dataframes_as_dataset
from kartothek.io.eager import read_table

dataset_dir = TemporaryDirectory()
store_factory = partial(get_store_from_url, f"hfs://{dataset_dir.name}")

keys = ['a', 'b', 'c', np.nan]
values = range(len(keys))
d = dict(zip(keys, values))
df = pd.DataFrame.from_dict(d, orient='index').reset_index().rename(columns={'index': 'part', 0: 'value'})

dm = store_dataframes_as_dataset(
    store_factory,
    "a_unique_dataset_identifier",
    {'df': df},
    partition_on=['part']
)

table = read_table("a_unique_dataset_identifier", store_factory, table="df")

assert len(df) == len(table)

Used versions

``` # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 1_llvm conda-forge abseil-cpp 20200225.1 he1b5a44_2 conda-forge adal 1.2.2 py_0 conda-forge affine 2.3.0 py_0 conda-forge aiohttp 3.6.2 py37h516909a_0 conda-forge alabaster 0.7.12 py_0 conda-forge alembic 1.4.2 pyh9f0ad1d_0 conda-forge appdirs 1.4.3 py_1 conda-forge arrow-cpp 0.16.0 py37hd8d096e_1 conda-forge asn1crypto 1.3.0 py37_0 conda-forge astroid 2.3.3 pypi_0 pypi async-timeout 3.0.1 py_1000 conda-forge async_generator 1.10 py_0 conda-forge attrs 19.3.0 py_0 conda-forge aws-logging-handlers 2.0.3 pypi_0 pypi aws-sdk-cpp 1.7.164 h1f8afcc_0 conda-forge awscli 1.18.27 py37hc8dfbb8_0 conda-forge babel 2.8.0 py_0 conda-forge backcall 0.1.0 py_0 conda-forge bandit 1.6.2 py37_0 conda-forge beautifulsoup4 4.8.2 py37_0 conda-forge black 19.10b0 py37_0 conda-forge bleach 3.1.3 pyh8c360ce_0 conda-forge blinker 1.4 py_1 conda-forge blosc 1.17.1 he1b5a44_0 conda-forge bokeh 1.4.0 py37_0 conda-forge boost-cpp 1.72.0 h8e57a91_0 conda-forge boto 2.49.0 py_0 conda-forge boto3 1.12.27 pyh9f0ad1d_0 conda-forge botocore 1.15.27 pyh9f0ad1d_0 conda-forge bottleneck 1.3.2 py37h03ebfcd_1 conda-forge brotli 1.0.7 he1b5a44_1001 conda-forge bzip2 1.0.8 h516909a_2 conda-forge c-ares 1.15.0 h516909a_1001 conda-forge ca-certificates 2019.11.28 hecc5488_0 conda-forge cachetools 3.1.1 py_0 conda-forge cairo 1.16.0 hcf35c78_1003 conda-forge cartopy 0.17.0 py37h6078e7d_1013 conda-forge certifi 2019.11.28 py37hc8dfbb8_1 conda-forge certipy 0.1.3 py_0 conda-forge cffi 1.14.0 py37hd463f26_0 conda-forge cfgv 3.1.0 py_0 conda-forge cfitsio 3.470 hb60a0a2_2 conda-forge cftime 1.1.1.2 py37h03ebfcd_0 conda-forge chardet 3.0.4 py37hc8dfbb8_1006 conda-forge click 7.1.1 pyh8c360ce_0 conda-forge click-plugins 1.1.1 py_0 conda-forge cligj 0.5.0 py_0 conda-forge cloudpickle 1.2.2 py_1 conda-forge colorama 0.4.3 py_0 conda-forge colorcet 2.0.1 py_0 conda-forge configurable-http-proxy 4.2.0 node13_he01fd0c_2 conda-forge coverage 5.0.4 py37h8f50634_0 conda-forge croniter 0.3.30 py_0 conda-forge cryptography 2.8 py37hb09aad4_2 conda-forge cssselect 1.1.0 py_0 conda-forge curl 7.68.0 hf8cf82a_0 conda-forge cycler 0.10.0 py_2 conda-forge cython 0.29.15 py37h3340039_1 conda-forge cytoolz 0.10.1 py37h516909a_0 conda-forge dask 2.12.0 py_0 conda-forge dask-core 2.12.0 py_0 conda-forge dask-gateway 0.6.1 py37_0 conda-forge dask-kubernetes 0.10.1 py_0 conda-forge dask-labextension 2.0.1 pypi_0 pypi datashader 0.10.0 py_0 conda-forge datashape 0.5.4 py_1 conda-forge dbus 1.13.6 he372182_0 conda-forge decorator 4.4.2 py_0 conda-forge defusedxml 0.6.0 py_0 conda-forge deprecated 1.2.7 py_0 conda-forge descartes 1.1.0 py_4 conda-forge distributed 2.12.0 py37_0 conda-forge docker-py 4.2.0 py37_0 conda-forge docker-pycreds 0.4.0 py_0 conda-forge docutils 0.15.2 py37_0 conda-forge dodgy 0.2.1 pypi_0 pypi editdistance 0.5.3 py37he1b5a44_0 conda-forge entrypoints 0.3 py37hc8dfbb8_1001 conda-forge et_xmlfile 1.0.1 py_1001 conda-forge expat 2.2.9 he1b5a44_2 conda-forge fastparquet 0.3.3 py37hc1659b7_0 conda-forge fiona 1.8.13 py37h900e953_0 conda-forge flake8 3.7.9 py37hc8dfbb8_1 conda-forge fontconfig 2.13.1 h86ecdb6_1001 conda-forge freetype 2.10.1 he06d7ca_0 conda-forge freexl 1.0.5 h14c3975_1002 conda-forge fribidi 1.0.9 h516909a_0 conda-forge fsspec 0.6.3 py_0 conda-forge funcsigs 1.0.2 py_3 conda-forge gcsfs 0.6.0 py_0 conda-forge gdal 3.0.4 py37h4b180d9_2 conda-forge geographiclib 1.50 py_0 conda-forge geojson 2.5.0 py_0 conda-forge geopandas 0.7.0 py_1 conda-forge geopy 1.21.0 py_0 conda-forge geos 3.8.1 he1b5a44_0 conda-forge geotiff 1.5.1 hcbe54f9_9 conda-forge geoviews 1.6.6 py_1 conda-forge geoviews-core 1.6.6 py_1 conda-forge gettext 0.19.8.1 hc5be6a0_1002 conda-forge gflags 2.2.2 he1b5a44_1002 conda-forge giflib 5.2.1 h516909a_2 conda-forge gitdb 4.0.2 py_0 conda-forge gitpython 3.1.0 py_0 conda-forge glib 2.58.3 py37he00f558_1003 conda-forge glog 0.4.0 he1b5a44_1 conda-forge gmaps 0.9.0 py_0 conda-forge google-api-core 1.16.0 py37_1 conda-forge google-auth 1.11.2 py_0 conda-forge google-auth-oauthlib 0.4.1 py_2 conda-forge google-cloud-core 1.3.0 py_0 conda-forge google-cloud-storage 1.26.0 py_0 conda-forge google-resumable-media 0.5.0 py_1 conda-forge googleapis-common-protos 1.51.0 py37_1 conda-forge googlemaps 2.5.1 py_0 conda-forge graphite2 1.3.13 he1b5a44_1001 conda-forge graphviz 2.42.3 h0511662_0 conda-forge grpc-cpp 1.27.3 h7397029_1 conda-forge gst-plugins-base 1.14.5 h0935bb2_2 conda-forge gstreamer 1.14.5 h36ae1b5_2 conda-forge h5py 2.10.0 nompi_py37h513d04c_102 conda-forge harfbuzz 2.4.0 h9f30f68_3 conda-forge haversine 2.2.0 py_0 conda-forge hdf4 4.2.13 hf30be14_1003 conda-forge hdf5 1.10.5 nompi_h3c11f04_1104 conda-forge heapdict 1.0.1 py_0 conda-forge holoviews 1.13.0 pyh9f0ad1d_0 conda-forge html5lib 1.0.1 py_0 conda-forge hvplot 0.5.2 py_0 conda-forge hypothesis 5.7.1 py_0 conda-forge icu 64.2 he1b5a44_1 conda-forge identify 1.4.11 py_0 conda-forge idna 2.9 py_1 conda-forge imageio 2.8.0 py_0 conda-forge imagesize 1.2.0 py_0 conda-forge importlib-metadata 1.5.0 py37hc8dfbb8_1 conda-forge importlib_metadata 1.5.0 1 conda-forge importnb 0.6.0 py37_0 conda-forge intake 0.5.4 py_0 conda-forge intake-parquet 0.2.3 py_0 conda-forge intake_geopandas 0.2.2 0 informaticslab ipykernel 5.2.0 py37h43977f1_0 conda-forge ipython 7.13.0 py37hc8dfbb8_2 conda-forge ipython_genutils 0.2.0 py_1 conda-forge ipywidgets 7.5.1 py_0 conda-forge isort 4.3.21 py37hc8dfbb8_1 conda-forge jdcal 1.4.1 py_0 conda-forge jedi 0.16.0 py37hc8dfbb8_1 conda-forge jinja2 2.11.1 py_0 conda-forge jmespath 0.9.5 py_0 conda-forge joblib 0.14.1 py_0 conda-forge jpeg 9c h14c3975_1001 conda-forge json-c 0.13.1 h14c3975_1001 conda-forge json5 0.9.0 py_0 conda-forge jsonschema 3.2.0 py37hc8dfbb8_1 conda-forge jupyter 1.0.0 py_2 conda-forge jupyter-archive 0.5.5 py_0 conda-forge jupyter-server-proxy 1.3.0 py_0 conda-forge jupyter_bokeh 1.1.1 py_0 bokeh jupyter_client 5.3.4 py37_1 conda-forge jupyter_console 6.1.0 py_1 conda-forge jupyter_core 4.6.3 py37hc8dfbb8_1 conda-forge jupyter_telemetry 0.0.5 py_0 conda-forge jupyterhub 1.1.0 py37_2 conda-forge jupyterhub-base 1.1.0 py37_2 conda-forge jupyterlab 1.2.7 py_0 conda-forge jupyterlab-git 0.9.1 pypi_0 pypi jupyterlab-s3-browser 0.4.1 pypi_0 pypi jupyterlab_code_formatter 1.2.2 py_0 conda-forge jupyterlab_server 1.0.7 py_0 conda-forge kartothek 3.8.1 py_0 conda-forge kealib 1.4.12 hec59c27_0 conda-forge kiwisolver 1.1.0 py37h99015e2_1 conda-forge krb5 1.16.4 h2fd8d38_0 conda-forge kubernetes 1.16.3 ha4a5029_0 conda-forge kubernetes_asyncio 11.1.0 pyh8c360ce_0 conda-forge lazy-object-proxy 1.4.3 pypi_0 pypi ld_impl_linux-64 2.34 h53a641e_0 conda-forge libblas 3.8.0 16_openblas conda-forge libcblas 3.8.0 16_openblas conda-forge libclang 9.0.1 default_hde54327_0 conda-forge libcurl 7.68.0 hda55be3_0 conda-forge libdap4 3.20.4 hd3bb157_0 conda-forge libedit 3.1.20170329 hf8c457e_1001 conda-forge libevent 2.1.10 h72c5cf5_0 conda-forge libffi 3.2.1 he1b5a44_1007 conda-forge libgcc-ng 9.2.0 h24d8f2e_2 conda-forge libgdal 3.0.4 hce44138_2 conda-forge libgfortran-ng 7.3.0 hdf63c60_5 conda-forge libiconv 1.15 h516909a_1006 conda-forge libkml 1.3.0 hb574062_1011 conda-forge liblapack 3.8.0 16_openblas conda-forge libllvm8 8.0.1 hc9558a2_0 conda-forge libllvm9 9.0.1 hc9558a2_0 conda-forge libnetcdf 4.7.3 nompi_h9f9fd6a_101 conda-forge libopenblas 0.3.9 h5ec1e0e_0 conda-forge libpng 1.6.37 hed695b0_1 conda-forge libpq 12.2 hae5116b_0 conda-forge libprotobuf 3.11.4 h8b12597_0 conda-forge libsodium 1.0.17 h516909a_0 conda-forge libspatialindex 1.9.3 he1b5a44_3 conda-forge libspatialite 4.3.0a heb269f5_1037 conda-forge libssh2 1.8.2 h22169c7_2 conda-forge libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge libtiff 4.1.0 hc3755c2_3 conda-forge libtool 2.4.6 h14c3975_1002 conda-forge libuuid 2.32.1 h14c3975_1000 conda-forge libuv 1.34.0 h516909a_0 conda-forge libwebp 1.0.2 h56121f0_5 conda-forge libxcb 1.13 h14c3975_1002 conda-forge libxkbcommon 0.10.0 he1b5a44_0 conda-forge libxml2 2.9.10 hee79883_0 conda-forge libxslt 1.1.33 h31b3aaa_0 conda-forge llvm-openmp 9.0.1 hc9558a2_2 conda-forge llvmlite 0.31.0 py37h5202443_1 conda-forge locket 0.2.0 py_2 conda-forge loguru 0.4.1 py37_0 conda-forge lxml 4.5.0 py37he3881c9_1 conda-forge lz4-c 1.8.3 he1b5a44_1001 conda-forge lzo 2.10 h14c3975_1000 conda-forge mako 1.1.0 py_0 conda-forge mapclassify 2.2.0 py_0 conda-forge markdown 3.2.1 py_0 conda-forge markupsafe 1.1.1 py37h8f50634_1 conda-forge marshmallow 3.5.0 py_0 conda-forge marshmallow-oneofschema 2.0.1 py_0 conda-forge matplotlib 3.2.1 0 conda-forge matplotlib-base 3.2.1 py37h30547a4_0 conda-forge mccabe 0.6.1 py_1 conda-forge milksnake 0.1.5 py_0 conda-forge mistune 0.8.4 py37h516909a_1000 conda-forge mock 3.0.5 py37hc8dfbb8_1 conda-forge more-itertools 8.2.0 py_0 conda-forge msgpack-numpy 0.4.4.3 py_0 conda-forge msgpack-python 1.0.0 py37h99015e2_1 conda-forge multidict 4.7.5 py37h516909a_0 conda-forge multipledispatch 0.6.0 py_0 conda-forge munch 2.5.0 py_0 conda-forge mypy 0.770 py_0 conda-forge mypy_extensions 0.4.3 py37hc8dfbb8_1 conda-forge nb_conda_kernels 2.2.3 py37_0 conda-forge nbconvert 5.6.1 py37_0 conda-forge nbdime 1.1.0 pypi_0 pypi nbformat 5.0.4 py_0 conda-forge nbval 0.9.5 py_0 conda-forge ncurses 6.1 hf484d3e_1002 conda-forge netcdf4 1.5.3 nompi_py37hd35fb8e_102 conda-forge networkx 2.4 py_1 conda-forge nodeenv 1.3.5 py_0 conda-forge nodejs 13.10.1 hf5d1a2b_0 conda-forge notebook 6.0.3 py37_0 conda-forge nspr 4.25 he1b5a44_0 conda-forge nss 3.47 he751ad9_0 conda-forge numba 0.48.0 py37hb3f55d8_0 conda-forge numexpr 2.7.1 py37h0da4684_1 conda-forge numpy 1.18.1 py37h8960a57_1 conda-forge oauthlib 3.0.1 py_0 conda-forge olefile 0.46 py_0 conda-forge openjpeg 2.3.1 h981e76c_3 conda-forge openpyxl 3.0.3 py_0 conda-forge openssl 1.1.1e h516909a_0 conda-forge owslib 0.19.2 py_1 conda-forge packaging 20.1 py_0 conda-forge pamela 1.0.0 py_0 conda-forge pandas 1.0.3 py37h0da4684_0 conda-forge pandoc 2.9.2 0 conda-forge pandocfilters 1.4.2 py_1 conda-forge panel 0.8.1 pyh8c360ce_0 conda-forge pango 1.42.4 h7062337_2 conda-forge param 1.9.3 py_0 conda-forge parquet-cpp 1.5.1 2 conda-forge parso 0.6.2 py_0 conda-forge partd 1.1.0 py_0 conda-forge pathspec 0.7.0 py_0 conda-forge patsy 0.5.1 py_0 conda-forge pbr 5.4.2 py_0 conda-forge pcre 8.44 he1b5a44_0 conda-forge pendulum 2.1.0 py37hc8dfbb8_1 conda-forge pep8-naming 0.4.1 pypi_0 pypi pexpect 4.8.0 py37hc8dfbb8_1 conda-forge phantomjs 2.1.1 1 conda-forge pickleshare 0.7.5 py37hc8dfbb8_1001 conda-forge pillow 7.0.0 py37h718be6c_1 conda-forge pip 20.0.2 py_2 conda-forge pixman 0.38.0 h516909a_1003 conda-forge plotly 4.5.4 pyh8c360ce_0 conda-forge pluggy 0.13.0 py37_0 conda-forge polygon-geohasher 0.0.1 pypi_0 pypi poppler 0.67.0 h14e79db_8 conda-forge poppler-data 0.4.9 1 conda-forge postgresql 12.2 hf1211e9_0 conda-forge pre-commit 2.2.0 py37hc8dfbb8_0 conda-forge prefect 0.9.8 py_0 conda-forge proj 6.3.1 hc80f0dc_1 conda-forge prometheus_client 0.7.1 py_0 conda-forge prompt-toolkit 3.0.4 py_0 conda-forge prompt_toolkit 3.0.4 0 conda-forge prospector 1.2.0 pypi_0 pypi protobuf 3.11.4 py37he1b5a44_0 conda-forge psutil 5.7.0 py37h8f50634_1 conda-forge pthread-stubs 0.4 h14c3975_1001 conda-forge ptyprocess 0.6.0 py_1001 conda-forge py 1.8.1 py_0 conda-forge pyarrow 0.16.0 py37hd02d5f2_2 conda-forge pyasn1 0.4.8 py_0 conda-forge pyasn1-modules 0.2.7 py_0 conda-forge pycodestyle 2.4.0 pypi_0 pypi pycparser 2.20 py_0 conda-forge pyct 0.4.6 py_0 conda-forge pyct-core 0.4.6 py_0 conda-forge pycurl 7.43.0.5 py37h16ce93b_0 conda-forge pydocstyle 5.0.2 py_0 conda-forge pyepsg 0.4.0 py_0 conda-forge pyflakes 2.1.1 py_0 conda-forge pygments 2.6.1 py_0 conda-forge pyjwt 1.7.1 py_0 conda-forge pykdtree 1.3.1 py37h03ebfcd_1003 conda-forge pylama 7.7.1 pypi_0 pypi pylint 2.4.4 pypi_0 pypi pylint-celery 0.3 pypi_0 pypi pylint-django 2.0.12 pypi_0 pypi pylint-flask 0.6 pypi_0 pypi pylint-plugin-utils 0.6 pypi_0 pypi pyopenssl 19.1.0 py_1 conda-forge pyparsing 2.4.6 py_0 conda-forge pyproj 2.6.0 py37heba2c01_0 conda-forge pyqt 5.12.3 py37hcca6a23_1 conda-forge pyqt5-sip 4.19.18 pypi_0 pypi pyqtwebengine 5.12.1 pypi_0 pypi pyroma 2.6 pypi_0 pypi pyrsistent 0.15.7 py37h8f50634_1 conda-forge pyshp 2.1.0 py_0 conda-forge pysocks 1.7.1 py37hc8dfbb8_1 conda-forge pytables 3.6.1 py37h9f153d1_1 conda-forge pytest 5.4.1 py37hc8dfbb8_0 conda-forge pytest-cov 2.8.1 py_0 conda-forge python 3.7.6 h8356626_5_cpython conda-forge python-blosc 1.8.3 py37hb3f55d8_0 conda-forge python-box 4.2.2 py_0 conda-forge python-dateutil 2.8.1 py_0 conda-forge python-docx 0.8.10 pypi_0 pypi python-dotenv 0.12.0 py_0 conda-forge python-editor 1.0.4 py_0 conda-forge python-geohash 0.8.5 py37he1b5a44_0 conda-forge python-graphviz 0.13.2 py_0 conda-forge python-json-logger 0.1.11 py_0 conda-forge python-kubernetes 10.1.0 py37hc8dfbb8_1 conda-forge python-slugify 4.0.0 py_0 conda-forge python-snappy 0.5.4 py37h7cfaab3_1 conda-forge python_abi 3.7 1_cp37m conda-forge pytz 2019.3 py_0 conda-forge pytzdata 2019.3 py_0 conda-forge pyviz_comms 0.7.4 pyh8c360ce_0 conda-forge pywavelets 1.1.1 py37hc1659b7_0 conda-forge pyyaml 5.3.1 py37h8f50634_0 conda-forge pyzmq 19.0.0 py37hac76be4_1 conda-forge qt 5.12.5 hd8c4c69_1 conda-forge qtconsole 4.7.1 py_0 conda-forge qtpy 1.9.0 py_0 conda-forge rasterio 1.1.3 py37h900e953_0 conda-forge re2 2020.03.03 he1b5a44_0 conda-forge readline 8.0 hf8c457e_0 conda-forge regex 2020.2.20 py37h8f50634_1 conda-forge requests 2.23.0 pyh8c360ce_2 conda-forge requests-oauthlib 1.2.0 py_0 conda-forge requirements-detector 0.6 pypi_0 pypi retrying 1.3.3 py_2 conda-forge rsa 3.4.2 py_1 conda-forge rtree 0.9.4 py37h8526d28_1 conda-forge ruamel.yaml 0.16.6 py37h8f50634_1 conda-forge ruamel.yaml.clib 0.2.0 py37h8f50634_1 conda-forge s3fs 0.4.0 py_0 conda-forge s3transfer 0.3.3 py37_0 conda-forge scikit-image 0.16.2 py37hb3f55d8_0 conda-forge scikit-learn 0.22.2.post1 py37hcdab131_0 conda-forge scipy 1.4.1 py37h921218d_0 conda-forge seaborn 0.10.0 py_1 conda-forge selenium 3.141.0 py37h8f50634_1001 conda-forge send2trash 1.5.0 py_0 conda-forge setoptconf 0.2.0 pypi_0 pypi setuptools 46.1.1 py37hc8dfbb8_0 conda-forge shapely 1.7.0 py37hc88ce51_2 conda-forge simpervisor 0.3 py_1 conda-forge simplejson 3.17.0 py37h516909a_0 conda-forge simplekv 0.14.0 py_0 conda-forge singleton-decorator 1.0.0 pypi_0 pypi six 1.14.0 py_1 conda-forge smartystreets-python-sdk 4.4.1 pypi_0 pypi smmap 3.0.1 py_0 conda-forge snappy 1.1.8 he1b5a44_1 conda-forge snowballstemmer 2.0.0 py_0 conda-forge snuggs 1.4.7 py_0 conda-forge sortedcontainers 2.1.0 py_0 conda-forge soupsieve 1.9.4 py37hc8dfbb8_1 conda-forge spatialpandas 0.3.5 py_0 pyviz sphinx 2.4.4 py_0 conda-forge sphinxcontrib-applehelp 1.0.2 py_0 conda-forge sphinxcontrib-devhelp 1.0.2 py_0 conda-forge sphinxcontrib-htmlhelp 1.0.3 py_0 conda-forge sphinxcontrib-jsmath 1.0.1 py_0 conda-forge sphinxcontrib-qthelp 1.0.3 py_0 conda-forge sphinxcontrib-serializinghtml 1.1.4 py_0 conda-forge sqlalchemy 1.3.15 py37h8f50634_1 conda-forge sqlite 3.30.1 hcee41ef_0 conda-forge statsmodels 0.11.1 py37h8f50634_1 conda-forge stevedore 1.30.1 py_0 conda-forge storefact 0.10.0 py_0 conda-forge tabulate 0.8.7 pyh9f0ad1d_0 conda-forge tbb 2018.0.5 h2d50403_0 conda-forge tblib 1.6.0 py_0 conda-forge terminado 0.8.3 py37hc8dfbb8_1 conda-forge testpath 0.4.4 py_0 conda-forge text-unidecode 1.2 py_0 conda-forge thrift 0.11.0 py37he1b5a44_1001 conda-forge thrift-cpp 0.13.0 h62aa4f2_2 conda-forge tiledb 1.7.0 hcde45ca_2 conda-forge tk 8.6.10 hed695b0_0 conda-forge toml 0.10.0 py_0 conda-forge toolz 0.10.0 py_0 conda-forge tornado 6.0.4 py37h8f50634_1 conda-forge tqdm 4.43.0 py_0 conda-forge traitlets 4.3.3 py37hc8dfbb8_1 conda-forge typed-ast 1.4.1 py37h516909a_0 conda-forge typing_extensions 3.7.4.1 py37hc8dfbb8_1 conda-forge tzcode 2019a h516909a_1002 conda-forge unidecode 1.1.1 py_0 conda-forge uritools 3.0.0 py37hc8dfbb8_1 conda-forge urllib3 1.25.7 py37hc8dfbb8_1 conda-forge urlquote 1.1.4 py37hc8dfbb8_1 conda-forge virtualenv 16.7.5 py_0 conda-forge watermark 2.0.2 py_0 conda-forge wcwidth 0.1.8 py_0 conda-forge webencodings 0.5.1 py_1 conda-forge websocket-client 0.57.0 py37hc8dfbb8_1 conda-forge wheel 0.34.2 py_1 conda-forge widgetsnbextension 3.5.1 py37_0 conda-forge wrapt 1.11.2 pypi_0 pypi xarray 0.15.0 py_0 conda-forge xerces-c 3.2.2 h8412b87_1004 conda-forge xlrd 1.2.0 py_0 conda-forge xlsxwriter 1.2.8 py_0 conda-forge xlwt 1.3.0 py_1 conda-forge xorg-kbproto 1.0.7 h14c3975_1002 conda-forge xorg-libice 1.0.10 h516909a_0 conda-forge xorg-libsm 1.2.3 h84519dc_1000 conda-forge xorg-libx11 1.6.9 h516909a_0 conda-forge xorg-libxau 1.0.9 h14c3975_0 conda-forge xorg-libxdmcp 1.1.3 h516909a_0 conda-forge xorg-libxext 1.3.4 h516909a_0 conda-forge xorg-libxpm 3.5.13 h516909a_0 conda-forge xorg-libxrender 0.9.10 h516909a_1002 conda-forge xorg-libxt 1.1.5 h516909a_1003 conda-forge xorg-renderproto 0.11.1 h14c3975_1002 conda-forge xorg-xextproto 7.3.0 h14c3975_1002 conda-forge xorg-xproto 7.0.31 h14c3975_1007 conda-forge xz 5.2.4 h516909a_1002 conda-forge yaml 0.2.2 h516909a_1 conda-forge yapf 0.29.0 py_0 conda-forge yarl 1.3.0 py37h516909a_1000 conda-forge zeromq 4.3.2 he1b5a44_2 conda-forge zict 2.0.0 py_0 conda-forge zipp 3.1.0 py_0 conda-forge zlib 1.2.11 h516909a_1006 conda-forge zstandard 0.13.0 py37he1b5a44_0 conda-forge zstd 1.4.4 h3b9ef0a_2 conda-forge ```
lr4d commented 4 years ago

Thanks for the report @brl0 . I've confirmed this on master.

The NaNs are dropped in kartothek.io_components.metapartition.MetaPartition._partition_data because of pandas' handling of nulls during groupby (see: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby-missing).

@fjetter is this behavior something that should be expected? We might want to document that if it's the case

fjetter commented 4 years ago

Well, it is expected if you know the implementation but from a user/api perspective we do not advertise that we have the same restrictions like pandas.groupby does.

I believe this is a valid request since the current behaviour silently drops data which may be fine in analyses scenarios (pandas) but not for data storage. we should not silently drop data. The very least is a warning or even better an exception.

Filling values is tricky and honestly I don't know how to approach this reasonably without opening us to rather extreme creep or weird APIs.

I would feel most comfortable with raising in this scenario where the exception suggests the users to take care of the filling themselves. After all, the users know best what sentinel values are appropriate for their application and we wouldn't need to break roundtrips.

I'd be curious how this is handled in arrow since technically speaking ['a', 'b', 'c', np.nan] is not a string column (although recognized as such) but a mixed type array.

fjetter commented 4 years ago

A side not to the implementation: The invariant we would like to preserve is the row count before and after. Checking this instead of null/nans is probably faster and more universal. If we detect this we can of course suggest (or even check it once manually) to the user that NaNs are a probably cause.