dragnet-org / dragnet

Just the facts -- web page content extraction
MIT License
1.24k stars 179 forks source link

Compatibility with Newer Python Versions #107

Open mbowiewilson opened 3 years ago

mbowiewilson commented 3 years ago

Hello,

I am having trouble using dragnet with python3.9. In particular, I get an error like this when importing dragnet:

root@2e4bbb389174:/home# python3
Python 3.9.2 (default, Feb 19 2021, 17:23:45)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dragnet
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/dragnet/__init__.py", line 1, in <module>
    from dragnet.blocks import Blockifier, PartialBlock, BlockifyError
  File "dragnet/blocks.pyx", line 32, in init dragnet.blocks
  File "/usr/local/lib/python3.9/site-packages/dragnet/compat.py", line 265, in <module>
    from sklearn import __version__ as sklearn_version
  File "/usr/local/lib/python3.9/site-packages/sklearn/__init__.py", line 64, in <module>
    from .base import clone
  File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 14, in <module>
    from .utils.fixes import signature
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/__init__.py", line 14, in <module>
    from . import _joblib
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/_joblib.py", line 22, in <module>
    from ..externals import joblib
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/__init__.py", line 119, in <module>
    from .parallel import Parallel
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/parallel.py", line 28, in <module>
    from ._parallel_backends import (FallbackToBackend, MultiprocessingBackend,
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 22, in <module>
    from .executor import get_memmapping_executor
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/executor.py", line 14, in <module>
    from .externals.loky.reusable_executor import get_reusable_executor
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/loky/__init__.py", line 12, in <module>
    from .backend.reduction import set_loky_pickler
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/loky/backend/reduction.py", line 125, in <module>
    from sklearn.externals.joblib.externals import cloudpickle  # noqa: F401
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/cloudpickle/__init__.py", line 3, in <module>
    from .cloudpickle import *
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py", line 152, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/usr/local/lib/python3.9/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py", line 133, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

After some googling around, it appears that this error is related to changes introduced in python3.8. If my understanding of the issue is correct, would it be possible to support newer versions of python in dragnet? Thanks in advance.

alexcleu commented 3 years ago

+1

b4hand commented 3 years ago

Looks like this error is actually coming from a downstream dependency and not dragnet directly. It's possible that updating the dependency might make it work. I suspect that it's the dependency that is incompatible with Python 3.8.

jdddog commented 2 years ago

I fixed this for Python 3.8 by upgrading scikit-learn to 0.21.3.

I couldn't upgrade scikit-learn any higher because several scikit-learn modules have changed, causing problems unpicking the existing models: