NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
482 stars 58 forks source link

[BUG] ImportError: lxml.html.clean module is now a separate project lxml_html_clean. #26

Closed chenrui17 closed 5 months ago

chenrui17 commented 5 months ago

Describe the bug

from nemo_curator.download import download_common_crawl
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/NeMo-Curator/nemo_curator/download/__init__.py", line 16, in <module>
    from .commoncrawl import (
  File "/opt/NeMo-Curator/nemo_curator/download/commoncrawl.py", line 21, in <module>
    import justext
  File "/usr/local/lib/python3.10/dist-packages/justext/__init__.py", line 12, in <module>
    from .core import justext
  File "/usr/local/lib/python3.10/dist-packages/justext/core.py", line 21, in <module>
    from lxml.html.clean import Cleaner
  File "/usr/local/lib/python3.10/dist-packages/lxml/html/clean.py", line 18, in <module>
    raise ImportError(
ImportError: lxml.html.clean module is now a separate project lxml_html_clean.
Install lxml[html_clean] or lxml_html_clean directly.

Steps/Code to reproduce bug

https://github.com/NVIDIA/NeMo-Curator/issues/25 use this Dockerfile to build image, and create a container successfuly, but when i run from nemo_curator.download import download_common_crawl, it encountered an error above.

ryantwolf commented 5 months ago

Yeah we saw this too. It's a problem with jusText. I have found a workaround that I will make a PR for later.