AutoTokenizer中的模型应当是什么呢？Bert吗？

lskunk commented 1 year ago

由于连接不到huggingface所以我将bert模型下载到了本地，我将代码修改如下： try: self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) except: self.tokenizer = BertTokenizer.from_pretrained('/home/lsk/python/code/bert-base-uncased/bert-base-uncased-vocab.txt', use_fast=True) 和class ASTE(pl.LightningModule): def init(self, hparams, data_module): super().init() self.save_hyperparameters(hparams) self.data_module = data_module

    self.config = BertConfig.from_pretrained(self.hparams.model_name_or_path)
    self.config.table_num_labels = self.data_module.table_num_labels

为什么会出现 Original Traceback (most recent call last): File "/home/lsk/miniconda3/envs/de/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/lsk/miniconda3/envs/de/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch return self.collate_fn(data) File "/home/lsk/BDTF/code/utils/aste_datamodule.py", line 82, in call batch = self.tokenizer_function(examples) File "/home/lsk/BDTF/code/utils/aste_datamodule.py", line 154, in tokenizer_function encoding = batch_encodings[i] File "/home/lsk/miniconda3/envs/de/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 247, in getitem raise KeyError( KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.'这样的问题请问Autotokenizer和Autoconfig中的模型是什么？麻烦回答一下谢谢

1140310118 commented 1 year ago

你好，你这里 BertTokenizer.from_pretrained(model_name_or_path)中的参数设置不对。model_name_or_path要么是bert-base-uncased，要么是一个目录，目录下面有名为vocab.txt的文件。

lskunk commented 1 year ago

如果要是这个参数设置的不对，那么代码应当在这行报错而不是在其他地方。如果我按照您说的方式，报错如下： File "/home/lsk/BDTF/code/utils/aste_datamodule.py", line 208, in init self.tokenizer = BertTokenizer.from_pretrained('/home/lsk/python/code/bert-base-uncased/', use_fast=True) File "/home/lsk/miniconda3/envs/de/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1838, in from_pretrained raise EnvironmentError( OSError: Can't load tokenizer for '/home/lsk/python/code/bert-base-uncased/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/home/lsk/python/code/bert-base-uncased/' is the correct path to a directory containing all relevant files for a BertTokenizer tokenizer. 并且我使用如下代码是可以运行的： from transformers import AutoTokenizer,BertTokenizer

tokenizer=BertTokenizer.from_pretrained('/home/lsk/python/code/bert-base-uncased/bert-base-uncased-vocab.txt') batch_sentences=["Hello I'm a single sentence","And another sentence","And the very very last one"]

batch=tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")

print(batch)

1140310118 commented 1 year ago

你好，我测试了一下，应该还是tokenizer加载上有问题。

使用下面的加载方式，这样是会报错的。

tok = BertTokenizer.from_pretrained('/data10T/zhangyice/2023/pretrained_models/bert-base-uncased/', use_fast=True)
tok(['1', '2'])[0]

错误信息如下:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zhangyice/anaconda3/envs/pytorch191/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 247, in __getitem__
    raise KeyError(
KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.'

注意，/data10T/zhangyice/2023/pretrained_models/bert-base-uncased/目录下面需要有config.json和vocab.txt；另外，将上面的路径替换为/data10T/zhangyice/2023/pretrained_models/bert-base-uncased/vocab.txt，会报一样的错。

但是，使用下面这种加载方式的，是不会报错的。

tok = AutoTokenizer.from_pretrained('/data10T/zhangyice/2023/pretrained_models/bert-base-uncased/', use_fast=True)
tok(['1', '2'])[0]

evanyfyang commented 1 year ago

KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.' 这条报错的原因是BertTokenizer中没有use_fast这个参数这个参数仅存在于AutoTokenizer中，目的是调用BertTokenizerFast这个Tokenizer 这里修改为BertTokenizerFast.from_pretrained()即可

HITSZ-HLT / BDTF-ABSA

AutoTokenizer中的模型应当是什么呢？Bert吗？ #2