allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
972 stars 107 forks source link

PII didn't working #217

Closed wannaphong closed 1 week ago

wannaphong commented 1 week ago

Hello! Thank you for providing the pipeline. I tried using Dolma to remove PII, but it didn't work as expected. Could you help me troubleshoot this?

taggers: pii_regex_v1 python: 3.10

Traceback (most recent call last):
  File "/usr/local/bin/dolma", line 8, in <module>
    sys.exit(main())
  File "/workspace/python/dolma/cli/main.py", line 93, in main
    return cli.run_from_args(args=args, config=config)
  File "/workspace/python/dolma/cli/__init__.py", line 188, in run_from_args
    parsed_config = namespace_to_nested_omegaconf(
  File "/workspace/python/dolma/cli/__init__.py", line 150, in namespace_to_nested_omegaconf
    merged_config = om.merge(base_structured_config, untyped_config)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/omegaconf.py", line 274, in merge
    target.merge_with(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 507, in merge_with
    self._format_and_raise(key=None, value=None, cause=e)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/base.py", line 229, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 826, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 804, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 502, in merge_with
    self._merge_with(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 530, in _merge_with
    BaseContainer._map_merge(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 401, in _map_merge
    dest.__setitem__(key, src_node)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/dictconfig.py", line 312, in __setitem__
    self._format_and_raise(key=key, value=value, cause=e)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/base.py", line 229, in _format_and_raise
    format_and_raise(
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 906, in format_and_raise
    _raise(ex, cause)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/_utils.py", line 804, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/dictconfig.py", line 306, in __setitem__
    self.__set_impl(key=key, value=value)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/dictconfig.py", line 316, in __set_impl
    self._set_item_impl(key, value)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/basecontainer.py", line 628, in _set_item_impl
    self.__dict__["_content"][key]._set_value(value)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/listconfig.py", line 618, in _set_value
    raise e
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/listconfig.py", line 614, in _set_value
    self._set_value_impl(value, flags)
  File "/usr/local/lib/python3.10/dist-packages/omegaconf/listconfig.py", line 646, in _set_value_impl
    raise ValidationError(msg)
omegaconf.errors.ValidationError: Invalid value assigned: AnyNode is not a ListConfig, list or tuple.
    full_key: taggers
    object_type=TaggerConfig

My config json:

{
    "documents": [
      "./dataset/documents/data.jsonl.gz"
    ],
    "taggers": "pii_regex_v1",
    "processes": 1
}
soldni commented 1 week ago

You are specifying the path to documents incorrectly. It should be dataset/documents/data.jsonl.gz, not ./dataset/documents/data.jsonl.gz.

wannaphong commented 1 week ago

@soldni It still have the error.

Traceback (most recent call last):
  File "/workspace/dolmavenv/bin/dolma", line 8, in <module>
    sys.exit(main())
  File "/workspace/python/dolma/cli/main.py", line 93, in main
    return cli.run_from_args(args=args, config=config)
  File "/workspace/python/dolma/cli/__init__.py", line 188, in run_from_args
    parsed_config = namespace_to_nested_omegaconf(
  File "/workspace/python/dolma/cli/__init__.py", line 150, in namespace_to_nested_omegaconf
    merged_config = om.merge(base_structured_config, untyped_config)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/omegaconf.py", line 273, in merge
    target.merge_with(*configs[1:])
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 492, in merge_with
    self._format_and_raise(key=None, value=None, cause=e)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/_utils.py", line 819, in format_and_raise
    _raise(ex, cause)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 490, in merge_with
    self._merge_with(*others)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 514, in _merge_with
    BaseContainer._map_merge(self, other)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 401, in _map_merge
    dest.__setitem__(key, src_node)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 314, in __setitem__
    self._format_and_raise(key=key, value=value, cause=e)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/base.py", line 231, in _format_and_raise
    format_and_raise(
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
    _raise(ex, cause)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/_utils.py", line 797, in _raise
    raise ex.with_traceback(sys.exc_info()[2])  # set env var OC_CAUSE=1 for full trace
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 308, in __setitem__
    self.__set_impl(key=key, value=value)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/dictconfig.py", line 318, in __set_impl
    self._set_item_impl(key, value)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/basecontainer.py", line 604, in _set_item_impl
    self.__dict__["_content"][key]._set_value(value)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/listconfig.py", line 618, in _set_value
    raise e
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/listconfig.py", line 614, in _set_value
    self._set_value_impl(value, flags)
  File "/workspace/dolmavenv/lib/python3.10/site-packages/omegaconf/listconfig.py", line 646, in _set_value_impl
    raise ValidationError(msg)
omegaconf.errors.ValidationError: Invalid value assigned: AnyNode is not a ListConfig, list or tuple.
    full_key: taggers
    object_type=TaggerConfig

requirements.txt


aiobotocore==2.15.1
aiohappyeyeballs==2.4.0
aiohttp==3.10.6
aioitertools==0.12.0
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
anyascii==0.3.2
asttokens==2.4.1
async-timeout==4.0.3
attrs==24.2.0
babel==2.16.0
backports-datetime-fromisoformat==2.0.2
beautifulsoup4==4.12.3
black==24.10.0
blingfire==0.1.8
blis==0.7.11
boto3==1.35.23
botocore==1.35.23
Brotli==1.1.0
cached_path==1.6.3
cachetools==5.5.0
catalogue==2.0.10
cchardet==2.1.7
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
cloudpathlib==0.20.0
confection==0.1.5
courlan==1.3.1
cymem==2.0.8
dateparser==1.2.0
decorator==5.1.1
detect-secrets==1.4.0
dolma
exceptiongroup==1.2.2
executing==2.1.0
fasttext-wheel==0.9.2
FastWARC==0.14.9
faust-cchardet==2.1.19
filelock==3.13.4
flake8==7.1.1
flake8-pyi==24.9.0
Flake8-pyproject==1.2.3
frozenlist==1.4.1
fsspec==2024.9.0
google-api-core==2.20.0
google-auth==2.35.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.6.0
google-resumable-media==2.7.2
googleapis-common-protos==1.65.0
htmldate==1.9.1
huggingface-hub==0.23.5
idna==3.10
iniconfig==2.0.0
ipdb==0.13.13
ipython==8.28.0
isort==5.13.2
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
joblib==1.4.2
jusText==3.0.1
langcodes==3.4.1
langdetect==1.0.9
language_data==1.2.0
lingua-language-detector==2.0.2
LTpycld2==0.42
lxml==5.3.0
lxml_html_clean==0.3.1
marisa-trie==1.2.1
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib-inline==0.1.7
mccabe==0.7.0
mdurl==0.1.2
msgspec==0.18.6
multidict==6.1.0
murmurhash==1.0.10
mypy==1.13.0
mypy-extensions==1.0.0
necessary==0.4.3
nltk==3.9.1
numpy==1.26.4
omegaconf==2.3.0
packaging==24.1
parso==0.8.4
patchelf==0.17.2.1
pathspec==0.12.1
pexpect==4.9.0
phonenumbers==8.13.48
platformdirs==4.3.6
pluggy==1.5.0
preshed==3.0.9
presidio-analyzer==2.2.32
prompt_toolkit==3.0.48
proto-plus==1.24.0
protobuf==5.28.3
ptyprocess==0.7.0
pure_eval==0.2.3
py3langid==0.2.2
pyasn1==0.6.1
pyasn1_modules==0.4.1
pybind11==2.13.6
pycodestyle==2.12.1
pydantic==2.9.2
pydantic_core==2.23.4
pyflakes==3.2.0
Pygments==2.18.0
PyICU==2.13.1
pytest==8.3.3
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
requests-file==2.1.0
requirements-parser==0.11.0
Resiliparse==0.14.9
rich==13.8.1
rsa==4.9
s3fs==2024.9.0
s3transfer==0.10.2
shellingham==1.5.4
six==1.16.0
smart-open==7.0.4
soupsieve==2.6
spacy==3.7.5
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.4.8
stack-data==0.6.3
thinc==8.2.5
tld==0.13
tldextract==5.1.2
tokenizers==0.19.1
tomli==2.0.2
tqdm==4.66.5
trafilatura==1.12.2
traitlets==5.14.3
typer==0.12.5
types-setuptools==75.1.0.20240917
typing_extensions==4.12.2
tzdata==2024.2
tzlocal==5.2
uniseg==0.8.1
url-normalize==1.4.3
urllib3==2.2.3
w3lib==2.2.1
wasabi==1.1.3
wcwidth==0.2.13
weasel==0.4.1
wrapt==1.16.0
yarl==1.12.1
zstandard==0.23.0
wannaphong commented 1 week ago

I can run bff_duplicate_docs but dedupe_paragraphs and pii can't working.