Closed shimsan closed 4 years ago
Please confirm you have the latest versions of fastai, fastcore, fastscript, and nbdev prior to reporting a bug (delete one): YES
fastai2 0.0.25 fastcore-0.1.30 sentencepiece-0.1.86
Describe the bug
The functionality to override tokenizer is missing now in 0.0.25. Previously, it used to be tok_func like below:
tok_func
dls_lm = DataBlock( blocks=TextBlock.from_folder(path, is_lm=True, tok_func=SentencePieceTokenizer), get_items=get_imdb, splitter=RandomSplitter(0.1) ).dataloaders(path, path=path, bs=128, seq_len=80)
To Reproduce
Colab examples provided below.
Example of how we used to override tokenizer and train in fastai2-0.0.20 (but it fails at inference same as https://github.com/fastai/fastai2/issues/424)
https://colab.research.google.com/drive/1Typ_xZWg5Jds-WDP8v2lEwPAoB2EKbn2?usp=sharing
Failing example with fastai2-0.0.25: https://colab.research.google.com/drive/1m7eq3sC8pJBIi79j_hoe-8QOWOfGp1-9?usp=sharing
Expected behavior
Expected to be able to override tokenizer function and run inference
Error with full stack trace
-------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-5-cf7c257943cd> in <module>() ----> 1 imdb = DataBlock(blocks=(TextBlock.from_folder(path, tok_func=SentencePieceTokenizer), CategoryBlock), 2 get_items=get_text_files, 3 get_y=parent_label, 4 splitter=GrandparentSplitter(valid_name='test')) 2 frames /usr/local/lib/python3.6/dist-packages/fastai2/text/data.py in from_folder(cls, path, vocab, is_lm, seq_len, backwards, min_freq, max_vocab, **kwargs) 210 def from_folder(cls, path, vocab=None, is_lm=False, seq_len=72, backwards=False, min_freq=3, max_vocab=60000, **kwargs): 211 "Build a `TextBlock` from a `path`" --> 212 return cls(Tokenizer.from_folder(path, **kwargs), vocab=vocab, is_lm=is_lm, seq_len=seq_len, 213 backwards=backwards, min_freq=min_freq, max_vocab=max_vocab) 214 /usr/local/lib/python3.6/dist-packages/fastai2/text/core.py in from_folder(cls, path, tok, rules, **kwargs) 274 path = Path(path) 275 if tok is None: tok = WordTokenizer() --> 276 output_dir = tokenize_folder(path, tok=tok, rules=rules, **kwargs) 277 res = cls(tok, counter=(output_dir/fn_counter_pkl).load(), 278 lengths=(output_dir/fn_lengths_pkl).load(), rules=rules, mode='folder') /usr/local/lib/python3.6/dist-packages/fastai2/text/core.py in tokenize_folder(path, extensions, folders, output_dir, skip_if_exists, **kwargs) 182 files = get_files(path, extensions=extensions, recurse=True, folders=folders) 183 def _f(i,output_dir): return output_dir/files[i].relative_to(path) --> 184 return _tokenize_files(_f, files, path, skip_if_exists=skip_if_exists, **kwargs) 185 186 # Cell TypeError: _tokenize_files() got an unexpected keyword argument 'tok_func'
When using tok:
tok
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-6-31c08ee3d74f> in <module>() ----> 1 imdb = DataBlock(blocks=(TextBlock.from_folder(path, tok=SentencePieceTokenizer), CategoryBlock), 2 get_items=get_text_files, 3 get_y=parent_label, 4 splitter=GrandparentSplitter(valid_name='test')) 5 frames /usr/local/lib/python3.6/dist-packages/fastai2/text/core.py in setup(self, items, rules) 355 from sentencepiece import SentencePieceProcessor 356 if rules is None: rules = [] --> 357 if self.tok is not None: return {'sp_model': self.sp_model} 358 raw_text_path = self.cache_dir/'texts.out' 359 with open(raw_text_path, 'w') as f: AttributeError: 'L' object has no attribute 'tok'
Additional context https://github.com/fastai/fastai2/issues/424
Fixed in master. Note that it should be tok=SentencePieceTokenizer() (i.e with parens) now, since you pass a tok, not a tok_func.
tok=SentencePieceTokenizer()
Please confirm you have the latest versions of fastai, fastcore, fastscript, and nbdev prior to reporting a bug (delete one): YES
Describe the bug
The functionality to override tokenizer is missing now in 0.0.25. Previously, it used to be
tok_func
like below:To Reproduce
Colab examples provided below.
Example of how we used to override tokenizer and train in fastai2-0.0.20 (but it fails at inference same as https://github.com/fastai/fastai2/issues/424)
https://colab.research.google.com/drive/1Typ_xZWg5Jds-WDP8v2lEwPAoB2EKbn2?usp=sharing
Failing example with fastai2-0.0.25: https://colab.research.google.com/drive/1m7eq3sC8pJBIi79j_hoe-8QOWOfGp1-9?usp=sharing
Expected behavior
Expected to be able to override tokenizer function and run inference
Error with full stack trace
When using
tok
:Additional context https://github.com/fastai/fastai2/issues/424