capitalone / DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
https://capitalone.github.io/DataProfiler
Apache License 2.0
1.42k stars 158 forks source link

Add Python 3.11 to GHA #1090

Closed gliptak closed 3 months ago

taylorfturner commented 7 months ago

@gliptak rebased onto dev so that all the branches are going into same branch prior to deployment to main

gliptak commented 7 months ago

@taylorfturner consider setting dev as default UI branch

python-snappy has no Python 3.11 currently https://github.com/andrix/python-snappy/pull/129

possible replacement is https://github.com/milesgranger/cramjam/tree/master/cramjam-python

gliptak commented 7 months ago

1091

gliptak commented 7 months ago

will rebase after #1091 merged

gliptak commented 6 months ago

https://github.com/capitalone/synthetic-data/pull/346

taylorfturner commented 6 months ago

@gliptak rebase onto dev and I'll approve

gliptak commented 6 months ago

@taylorfturner this might already be dev based

https://github.com/capitalone/DataProfiler/pull/1091 would have to be merged first

taylorfturner commented 6 months ago

@gliptak #1091 merged ... rebase this and we'll take a look. Thanks for the contribution! 🎉

gliptak commented 6 months ago

dask packaging changed? https://pypi.org/project/dask/#history

https://github.com/capitalone/DataProfiler/actions/runs/8282560184/job/22663633408?pr=1090

____ ERROR collecting dataprofiler/tests/validators/test_base_validators.py ____
/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/dask/dataframe/__init__.py:22: in _dask_expr_enabled
    import dask_expr  # noqa: F401
E   ModuleNotFoundError: No module named 'dask_expr'

During handling of the above exception, another exception occurred:
dataprofiler/tests/validators/test_base_validators.py:4: in <module>
    from dask import dataframe as dd
/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/dask/dataframe/__init__.py:87: in <module>
    if _dask_expr_enabled():
/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/dask/dataframe/__init__.py:24: in _dask_expr_enabled
    raise ValueError("Must install dask-expr to activate query planning.")
E   ValueError: Must install dask-expr to activate query planning.
=============================== warnings summary ===============================
dataprofiler/tests/profilers/test_histogram_utils.py:35
  /home/runner/work/DataProfiler/DataProfiler/dataprofiler/tests/profilers/test_histogram_utils.py:35: PytestCollectionWarning: cannot collect test class 'TestColumn' because it has a __init__ constructor (from: dataprofiler/tests/profilers/test_histogram_utils.py)
    class TestColumn(NumericStatsMixin):

dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py:[21](https://github.com/capitalone/DataProfiler/actions/runs/8282560184/job/22663633408?pr=1090#step:6:22)
  /home/runner/work/DataProfiler/DataProfiler/dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py:21: PytestCollectionWarning: cannot collect test class 'TestColumn' because it has a __init__ constructor (from: dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py)
    class TestColumn(NumericStatsMixin):

dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py:[34](https://github.com/capitalone/DataProfiler/actions/runs/8282560184/job/22663633408?pr=1090#step:6:35)
  /home/runner/work/DataProfiler/DataProfiler/dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py:34: PytestCollectionWarning: cannot collect test class 'TestColumnWProps' because it has a __init__ constructor (from: dataprofiler/tests/profilers/test_numeric_stats_mixin_profile.py)
    class TestColumnWProps(TestColumn):
taylorfturner commented 6 months ago

@gliptak yeah I just started seeing this yesterday due to the package change by dask on the 12th. Haven't had the bandwidth to research why. I'd imagine a simple tag to not allow for this version would be a temporary fix to unblock

taylorfturner commented 6 months ago

seeing on #1115 too from @carlsonp

gliptak commented 6 months ago

https://github.com/capitalone/DataProfiler/actions/runs/8282934358/job/22664948501

2024-03-14T15:03:11.8884631Z WARNING: dask 2024.3.0 does not provide the extra 'dask-expr'

https://github.com/dask/dask-expr/issues/968 https://github.com/dask/dask/issues/10917 https://docs.dask.org/en/stable/changelog.html#v2024-3-0

gliptak commented 6 months ago

@taylorfturner corrected dask modules install

now there is a Keras(?) error https://github.com/capitalone/DataProfiler/actions/runs/8283138450/job/22665615645?pr=1090

taylorfturner commented 6 months ago

@taylorfturner corrected dask modules install

now there is a Keras(?) error https://github.com/capitalone/DataProfiler/actions/runs/8283138450/job/22665615645?pr=1090

I'll have to take a look at this later -- its failing on the 3.11 check which makes me think there is something specific to that version of python and the dependencies / library that it doesn't like. I've seen a couple things from back in January about TF and 3.11 incompatibility ... though it does look like 3.11 is supported by keras here

gliptak commented 5 months ago

@taylorfturner please guide on build errors (present for all Python versions)

https://github.com/capitalone/DataProfiler/actions/runs/8331331583/job/22797982039?pr=1090

dataprofiler/tests/profilers/test_profile_builder.py ..............F.... [ 82%]
...........................F.F.................F........................ [ 88%]
........F.............................FF                                 [ 91%]
taylorfturner commented 5 months ago

@taylorfturner please guide on build errors (present for all Python versions)

https://github.com/capitalone/DataProfiler/actions/runs/8331331583/job/22797982039?pr=1090

dataprofiler/tests/profilers/test_profile_builder.py ..............F.... [ 82%]
...........................F.F.................F........................ [ 88%]
........F.............................FF                                 [ 91%]

will do @gliptak -- have a conference this week so I will do my best to get to it, but it might be more like early next week before I can attend to this. Thanks!

taylorfturner commented 5 months ago

Actually seeing similar errors on #1119; so, this doesn't appear to be a 3.11 issue specifically @gliptak. @abajpai15, is taking a look today at this issue

JGSweets commented 4 months ago

Some of these errors might get fixed with the upgrade to keras 3.0 in #1138

gliptak commented 4 months ago

will rebase after #1138 merged

taylorfturner commented 3 months ago

@gliptak #1138 is merged into dev.

Want to rebase and see if this works now? Thanks!

taylorfturner commented 3 months ago

definitely will need a rebase @gliptak

gliptak commented 3 months ago

https://github.com/dask/dask/issues/11038

The advised solution is to upgrade to Dask version 2024.4.1.

@taylorfturner am I to proceed with Dask bump as per above?

https://github.com/capitalone/DataProfiler/actions/runs/9421219740/job/25954837467?pr=1090

=========================== short test summary info ============================
ERROR dataprofiler/tests/validators/test_base_validators.py - TypeError: descriptor '__call__' for 'type' objects doesn't apply to a 'property' object
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
gliptak commented 3 months ago

https://github.com/capitalone/DataProfiler/actions/runs/9438503984/job/25995558872?pr=1090

___________________ TestEvaluateAccuracy.test_save_conf_mat ____________________
self = <dataprofiler.tests.labelers.test_labeler_utils.TestEvaluateAccuracy testMethod=test_save_conf_mat>
mock_dataframe = <MagicMock name='DataFrame' id='140514062390544'>
mock_report = <MagicMock name='classification_report' id='[140](https://github.com/capitalone/DataProfiler/actions/runs/9438503984/job/25995558872?pr=1090#step:6:141)514060067088'>
    @mock.patch("dataprofiler.labelers.labeler_utils.classification_report")
    @mock.patch("pandas.DataFrame")
    def test_save_conf_mat(self, mock_dataframe, mock_report):

        # ideally mock out the actual contents written to file, but
        # would be difficult to get this completely worked out.
        expected_conf_mat = np.array(
            [
                [1, 0, 1],
                [1, 0, 0],
                [0, 1, 2],
            ]
        )
        expected_row_col_names = dict(
            columns=["pred:PAD", "pred:UNKNOWN", "pred:OTHER"],
            index=["true:PAD", "true:UNKNOWN", "true:OTHER"],
        )
>       mock_instance_df = mock.Mock(spec=pd.DataFrame)()
dataprofiler/tests/labelers/test_labeler_utils.py:255: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/unittest/mock.py:1106: in __init__
    _safe_super(CallableMixin, self).__init__(
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/unittest/mock.py:457: in __init__
    self._mock_add_spec(spec, spec_set, _spec_as_instance, _eat_self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <[AttributeError('_mock_methods') raised in repr()] Mock object at 0x7fcc6aa22410>
spec = <MagicMock name='DataFrame' id='140514062390544'>, spec_set = None
_spec_as_instance = False, _eat_self = False
    def _mock_add_spec(self, spec, spec_set, _spec_as_instance=False,
                       _eat_self=False):
        if _is_instance_mock(spec):
>           raise InvalidSpecError(f'Cannot spec a Mock object. [object={spec!r}]')
E           unittest.mock.InvalidSpecError: Cannot spec a Mock object. [object=<MagicMock name='DataFrame' id='140514062390544'>]
/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/unittest/mock.py:508: InvalidSpecError
gliptak commented 3 months ago

this is the above outstanding test fail https://github.com/python/cpython/issues/87644

https://github.com/capitalone/DataProfiler/blob/a4486940ce556a42bb804d188d2015e047dfc3c1/dataprofiler/tests/labelers/test_labeler_utils.py#L255

taylorfturner commented 3 months ago

this is the above outstanding test fail python/cpython#87644

https://github.com/capitalone/DataProfiler/blob/a4486940ce556a42bb804d188d2015e047dfc3c1/dataprofiler/tests/labelers/test_labeler_utils.py#L255

I see -- you are welcome to propose a fix for this as part of this PR (instead of a separate PR). If you get something operational, we can include this in the 0.12.0 release; otherwise, I will need to deploy without. Thanks, @gliptak!

gliptak commented 3 months ago

@taylorfturner I rewrote the test and it ran green locally. please review

also let me know if separate bump PRs would work better (and guide on how you would like to split)

taylorfturner commented 3 months ago

@taylorfturner I rewrote the test and it ran green locally. please review

also let me know if separate bump PRs would work better (and guide on how you would like to split)

Thanks, @gliptak !

gliptak commented 3 months ago

python-snappy>=0.7.1 bump is required for Python 3.11