from nemo_curator.download import download_common_crawl
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/NeMo-Curator/nemo_curator/download/__init__.py", line 16, in <module>
from .commoncrawl import (
File "/opt/NeMo-Curator/nemo_curator/download/commoncrawl.py", line 21, in <module>
import justext
File "/usr/local/lib/python3.10/dist-packages/justext/__init__.py", line 12, in <module>
from .core import justext
File "/usr/local/lib/python3.10/dist-packages/justext/core.py", line 21, in <module>
from lxml.html.clean import Cleaner
File "/usr/local/lib/python3.10/dist-packages/lxml/html/clean.py", line 18, in <module>
raise ImportError(
ImportError: lxml.html.clean module is now a separate project lxml_html_clean.
Install lxml[html_clean] or lxml_html_clean directly.
Steps/Code to reproduce bug
https://github.com/NVIDIA/NeMo-Curator/issues/25 use this Dockerfile to build image, and create a container successfuly, but when i run from nemo_curator.download import download_common_crawl, it encountered an error above.
Describe the bug
Steps/Code to reproduce bug
https://github.com/NVIDIA/NeMo-Curator/issues/25 use this Dockerfile to build image, and create a container successfuly, but when i run
from nemo_curator.download import download_common_crawl
, it encountered an error above.