NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
8.45k stars 1.32k forks source link

BUG: LayoutLMv3 finetuning on FUNSD Notebook; data preprocessing features #369

Closed Davo00 closed 3 months ago

Davo00 commented 7 months ago

https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb

ValueError: Arrow type extension<arrow.py_extension_type<pyarrow.lib.UnknownExtensionType>> does not have a datasets dtype equivalent.

Caused by:

# we need to define custom features for `set_format` (used later on) to work properly
features = Features({
    'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)),
    'input_ids': Sequence(feature=Value(dtype='int64')),
    'attention_mask': Sequence(Value(dtype='int64')),
    'bbox': Array2D(dtype="int64", shape=(512, 4)),
    'labels': Sequence(feature=Value(dtype='int64')),
})
NielsRogge commented 7 months ago

Hi,

I just ran the notebook, but it's working for me. Maybe you need to update the Datasets version?

Davo00 commented 7 months ago

Hey @NielsRogge , thanks for quick response. I have started a fresh env and installed everything with pip. Tried python 3.9 and 3.10. Which version do you use? Here is some info about my installed versions:

pip show datasets

Name: datasets Version: 2.14.6 Summary: HuggingFace community-driven open-source library of datasets

pip list

Package Version


accelerate 0.24.1 aiohttp 3.8.6 aiosignal 1.3.1 anyio 3.5.0 appnope 0.1.2 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.0.5 async-timeout 4.0.3 attrs 23.1.0 backcall 0.2.0 beautifulsoup4 4.12.2 bleach 4.1.0 certifi 2023.7.22 cffi 1.15.1 charset-normalizer 3.3.2 comm 0.1.2 datasets 2.14.6 debugpy 1.6.7 decorator 5.1.1 defusedxml 0.7.1 dill 0.3.7 entrypoints 0.4 exceptiongroup 1.0.4 executing 0.8.3 fastjsonschema 2.16.2 filelock 3.13.1 frozenlist 1.4.0 fsspec 2023.10.0 huggingface-hub 0.17.3 idna 3.4 importlib-metadata 6.0.0 IProgress 0.4 ipykernel 6.25.0 ipython 8.15.0 ipython-genutils 0.2.0 jedi 0.18.1 Jinja2 3.1.2 joblib 1.3.2 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 jupyter_client 7.4.9 jupyter_core 5.5.0 jupyter-server 1.23.4 jupyterlab-pygments 0.1.2 MarkupSafe 2.1.1 matplotlib-inline 0.1.6 mistune 2.0.4 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.15 nbclassic 1.0.0 nbclient 0.8.0 nbconvert 7.10.0 nbformat 5.9.2 nest-asyncio 1.5.6 networkx 3.2.1 notebook 6.5.4 notebook_shim 0.2.3 numpy 1.26.1 packaging 23.1 pandas 2.1.3 pandocfilters 1.5.0 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.1.0 pip 23.3 platformdirs 3.10.0 prometheus-client 0.14.1 prompt-toolkit 3.0.36 psutil 5.9.0 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 14.0.1 pycparser 2.21 Pygments 2.15.1 python-dateutil 2.8.2 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 23.2.0 referencing 0.30.2 regex 2023.10.3 requests 2.31.0 rpds-py 0.10.6 safetensors 0.4.0 scikit-learn 1.3.2 scipy 1.11.3 Send2Trash 1.8.2 seqeval 1.2.2 setuptools 68.0.0 six 1.16.0 sniffio 1.2.0 soupsieve 2.5 stack-data 0.2.0 sympy 1.12 terminado 0.17.1 threadpoolctl 3.2.0 tinycss2 1.2.1 tokenizers 0.14.1 torch 2.1.0 tornado 6.3.3 tqdm 4.66.1 traitlets 5.7.1 transformers 4.36.0.dev0 typing_extensions 4.7.1 tzdata 2023.3 urllib3 2.0.7 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 0.58.0 wheel 0.41.2 xxhash 3.4.1 yarl 1.9.2 zipp 3.11.0

NielsRogge commented 7 months ago

I just used Google Colab :)

Davo00 commented 7 months ago

I just used Google Colab :)

you are right it works in Colab. Really can't understand why the same code wouldn't work on my setup. Could it be a M1 issue? Does it try to use the "MPS" as device? Anyway, I guess I should rather move this issue to the datasets repo, right?