Closed Bing-su closed 1 year ago
안녕하세요 @Bing-su 님,
좋은 기여에 감사드립니다. 다만 PR 날리신걸 보면 getstate
만 구현이 되어 있는데, 이경우 pickle.dump
는 가능해도 pickle.load
는 불가능해보입니다. 또한 현재 getstate로는 Python단의 attribute만 저장되는데, 이게 의미 있는 pickle dump일지도 확인이 필요합니다. 관련해서 pickle로 dump한 파일을 다시 load하여 정상적으로 SwTokenizer가 작동하는지 확인 가능해주실 수 있을까요?
https://docs.python.org/ko/3.11/library/pickle.html?highlight=pickle#object.__setstate__
__getstate__
가 dict를 반환하면, __setstate__
를 정의할 필요는 없는 것 같습니다.
test/test_transformers_addon.py
에 피클-역피클 한 뒤 다시 테스트를 진행하는 코드를 추가했습니다.
(kiwi)
kiwipiepy on pickle via △ v3.27.0 via 🐍 v3.10.12 via 🅒 kiwi took 2s
❯ python .\test\test_transformers_addon.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
(kiwi)
kiwipiepy on pickle via △ v3.27.0 via 🐍 v3.10.12 via 🅒 kiwi took 3s
❯
그리고 피클 라이브러리들로 피클화한 뒤, 비교해보는 테스트를 진행해보았습니다.
import pickle
import dill
import cloudpickle
import kiwipiepy.transformers_addon
from transformers import AutoTokenizer
repo = "kiwi-farm/roberta-base-32k"
orig = AutoTokenizer.from_pretrained(repo)
with open("pk1.pkl", "wb") as f:
pickle.dump(orig, f)
with open("pk2.pkl", "wb") as f:
dill.dump(orig, f)
with open("pk3.pkl", "wb") as f:
cloudpickle.dump(orig, f)
from itertools import permutations
with open("pk1.pkl", "rb") as f:
upk1 = pickle.load(f)
with open("pk2.pkl", "rb") as f:
upk2 = dill.load(f)
with open("pk3.pkl", "rb") as f:
upk3 = cloudpickle.load(f)
for (tk1, tk2) in permutations([orig, upk1, upk2, upk3], 2):
for (k, v1), (_, v2) in zip(tk1.__dict__.items(), tk2.__dict__.items()):
if k != "_tokenizer":
assert getattr(tk1, k) == getattr(tk2, k)
else:
assert vars(getattr(tk1, k)) == vars(getattr(tk2, k))
print("ok!")
ok!
@Bing-su property만 찍어보면 정상적으로 작동하는 것처럼 보일 수 있지만, 내부의 c++로 구현된 object를 호출하는 부분이 연결되면 아마 오류가 뜰 것으로 예상되어서요. test에서 tokenizer.tokenize
등의 메소드를 호출해보는게 좋을것 같아서 test_transformers_addon에 해당 함수를 추가했습니다.
예상대로 unpickle후 kiwi를 사용하는 부분에서 segmentation fault가 발생하고 있습니다. c++단에서 Kiwi
object의 pickle/unpickle를 직접 지원하거나 아니면 SwTokenizer
object에서 unpickle시에 kiwi를 다시 적절하게 복원하는 작업이 필요할 것 같습니다.
Fatal Python error: Segmentation fault
Current thread 0x00007f5822561700 (most recent call first):
File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/sw_tokenizer.py", line 416 in kiwi
File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/sw_tokenizer.py", line 263 in encode
File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/transformers_addon.py", line 303 in _make_encoded
File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/transformers_addon.py", line 264 in _encode_plus
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2512 in encode_plus
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2176 in encode
File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/transformers_addon.py", line 469 in tokenize
File "/__w/kiwipiepy/kiwipiepy/test/test_transformers_addon.py", line 99 in test_pickle
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line [112](https://github.com/bab2min/kiwipiepy/actions/runs/5787034956/job/15683061040?pr=136#step:10:113) in _hookexec
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/python.py", line 1788 in runtest
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 262 in <lambda>
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 341 in from_call
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 262 in call_runtest_hook
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 222 in call_and_report
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 133 in runtestprotocol
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line [114](https://github.com/bab2min/kiwipiepy/actions/runs/5787034956/job/15683061040?pr=136#step:10:115) in pytest_runtest_protocol
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/main.py", line 324 in _main
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/main.py", line 270 in wrap_session
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line 112 in _hookexec
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/config/__init__.py", line 167 in main
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/config/__init__.py", line 189 in console_main
File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pytest/__main__.py", line 5 in <module>
File "/opt/python/cp37-cp37m/lib/python3.7/runpy.py", line 85 in _run_code
File "/opt/python/cp37-cp37m/lib/python3.7/runpy.py", line 193 in _run_module_as_main
/__w/_temp/ff2c73ac-ace6-4011-964a-9897075bfa1d.sh: line 1: 927 Segmentation fault (core dumped) /opt/python/cp37-cp37m/bin/python -m pytest --verbose test/test_transformers_addon.py
test/test_transformers_addon.py::test_pickle
말씀하신게 맞습니다. 더 테스트를 해보고 다시 찾아오겠습니다. 감사합니다.
fixes: #135
https://docs.python.org/ko/3.11/library/pickle.html?highlight=pickle#pickling-class-instances
python 3.11부터는 getstate가 정의되어있지 않을때의 기본 동작을 정의함으로써 이 문제를 해결한 것으로 보입니다.python 3.11에서도 같은 에러 발생python 3.10이하에서는 여전히 필요합니다.