adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
https://htmldate.readthedocs.io
Apache License 2.0
118 stars 26 forks source link

Error installing trafilatura on playwright focal image #94

Closed jaekunchoi closed 1 year ago

jaekunchoi commented 1 year ago

I'm getting below error when trying to install trafilatura on mcr.microsoft.com/playwright/python:v1.32.1-focal docker image

I tried many versions with no luck. Is there way to fix this without introducing a lot of image size?

Building wheels for collected packages: sentence-transformers, typing, uuid, backports-datetime-fromisoformat, lit
  Building wheel for sentence-transformers (setup.py): started
  Building wheel for sentence-transformers (setup.py): finished with status 'done'
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125926 sha256=85fbd76a2c8311631cab1cf9611cf0ef12e43e06c26bbaaca0a0ad9ab4323f63
  Stored in directory: /root/.cache/pip/wheels/5e/6f/8c/d88aec621f3f542d26fac0342bef5e693335d125f4e54aeffe
  Building wheel for typing (setup.py): started
  Building wheel for typing (setup.py): finished with status 'done'
  Created wheel for typing: filename=typing-3.7.4.3-py3-none-any.whl size=26305 sha256=ec7f26377d7304b784c9a15bf2152e785604f05a42b5e5467060b10f282f16d5
  Stored in directory: /root/.cache/pip/wheels/5e/5d/01/3083e091b57809dad979ea543def62d9d878950e3e74f0c930
  Building wheel for uuid (setup.py): started
  Building wheel for uuid (setup.py): finished with status 'done'
  Created wheel for uuid: filename=uuid-1.30-py3-none-any.whl size=6478 sha256=42f6b14e52efa4385e0e1d94a2aa9481407fa95875859e5090f7c7cc64dd5465
  Stored in directory: /root/.cache/pip/wheels/1b/6c/cb/f9aae2bc97333c3d6e060826c1ee9e44e46306a178e5783505
  Building wheel for backports-datetime-fromisoformat (setup.py): started
  Building wheel for backports-datetime-fromisoformat (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [16 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-38
      creating build/lib.linux-x86_64-cpython-38/backports
      copying backports/__init__.py -> build/lib.linux-x86_64-cpython-38/backports
      creating build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      copying backports/datetime_fromisoformat/__init__.py -> build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      running build_ext
      building 'backports._datetime_fromisoformat' extension
      creating build/temp.linux-x86_64-cpython-38
      creating build/temp.linux-x86_64-cpython-38/backports
      creating build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/include/python3.8 -c backports/datetime_fromisoformat/_datetimemodule.c -o build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat/_datetimemodule.o
      error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for backports-datetime-fromisoformat
  Running setup.py clean for backports-datetime-fromisoformat
  Building wheel for lit (pyproject.toml): started
  Building wheel for lit (pyproject.toml): finished with status 'done'
  Created wheel for lit: filename=lit-16.0.6-py3-none-any.whl size=93584 sha256=7eb1709c8fb581da100e3f4309e4d214a3e1db491afcc2f3aa2d8e092360fa61
  Stored in directory: /root/.cache/pip/wheels/05/ab/f1/0102fea49a41c753f0e79a1a4012417d5d7ef0f93224694472
Successfully built sentence-transformers typing uuid lit
Failed to build backports-datetime-fromisoformat
Installing collected packages: uuid, tokenizers, sentencepiece, safetensors, pytz, playwright-stealth, mpmath, lit, lambda-warmer-py, cmake, backports-datetime-fromisoformat, asyncio, urllib3, typing-extensions, typing, tqdm, tld, threadpoolctl, tabulate, sympy, soupsieve, sniffio, six, simplejson, regex, pyyaml, python-json-logger, python-dotenv, pyspellchecker, pluggy, pillow, packaging, nvidia-nvtx-cu11, nvidia-nccl-cu11, nvidia-cusparse-cu11, nvidia-curand-cu11, nvidia-cufft-cu11, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-cupti-cu11, nvidia-cublas-cu11, numpy, networkx, MarkupSafe, lxml, langcodes, joblib, jmespath, jellyfish, idna, h11, fsspec, fastapi-events, exceptiongroup, click, charset-normalizer, certifi, backports.zoneinfo, uvicorn, tzlocal, segtok, scipy, requests, python-dateutil, pydantic, nvidia-cusolver-cu11, nvidia-cudnn-cu11, nltk, mangum, justext, jinja2, filelock, courlan, beautifulsoup4, awslambdaric, anyio, yake, starlette, scikit-learn, rake-nltk, pandas, huggingface-hub, dateparser, botocore, transformers, s3transfer, htmldate, fastapi, trafilatura, boto3, triton, torch, torchvision, sentence-transformers
  Running setup.py install for backports-datetime-fromisoformat: started
  Running setup.py install for backports-datetime-fromisoformat: finished with status 'error'
  error: subprocess-exited-with-error

  × Running setup.py install for backports-datetime-fromisoformat did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      running install
      /usr/local/lib/python3.8/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-cpython-38
      creating build/lib.linux-x86_64-cpython-38/backports
      copying backports/__init__.py -> build/lib.linux-x86_64-cpython-38/backports
      creating build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      copying backports/datetime_fromisoformat/__init__.py -> build/lib.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      running build_ext
      building 'backports._datetime_fromisoformat' extension
      creating build/temp.linux-x86_64-cpython-38
      creating build/temp.linux-x86_64-cpython-38/backports
      creating build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat
      x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/usr/include/python3.8 -c backports/datetime_fromisoformat/_datetimemodule.c -o build/temp.linux-x86_64-cpython-38/backports/datetime_fromisoformat/_datetimemodule.o
      error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> backports-datetime-fromisoformat

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

[notice] A new release of pip is available: 23.0.1 -> 23.2.1
[notice] To update, run: python -m pip install --upgrade pip
The command '/bin/sh -c pip install -r requirements.txt' returned a non-zero code: 1
adbar commented 1 year ago

Hi @jaekunchoi, the backports-datetime-fromisoformat dependency in the underlying htmldate package seems to fail because there is no C compiler in your instance.

I introduced this dependency recently, please try installing an older version before installing trafilatura: pip install htmldate==1.4.3

Please keep me updated.

adbar commented 1 year ago

@jaekunchoi The issue is now solved, you can either update htmldate before reinstalling trafilatura or wait for the pending trafilatura release.