deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.5k stars 1.9k forks source link

how to load a local reader model in china (transformer) #1523

Closed datalee closed 3 years ago

datalee commented 3 years ago

model like thishttps://huggingface.co/wptoux/albert-chinese-large-qa


Edit Timo: So the problem was accessing the HF modelhub from China. There you must add a mirror

AutoModel.from_pretrained('bert-base-uncased', mirror='tuna')

as datalee pointed out below

Timoeller commented 3 years ago

Hey sorry for the late reply, I actually had to dig into the code a bit. So there are two ways to load models.

Loading model with internet access

The easiest is just to load a QAInferencer with your model name supplied. Then it downloads and converts everything you need. You can do that with: infer = QAInferencer.load(model_name_or_path="wptoux/albert-chinese-large-qa", task_type="question_answering", gpu=True)

Then you can save and load that Inferencer with .save(foo/bar) and .load(foo/bar)

Loading transformers model saved locally

If your transformers model happens to be saved locally (e.g. with git lfs cloning) and you do not have internet access at that machine you have to construct the FARM Reader model a bit. Here is the code for it:

from farm.modeling.adaptive_model import AdaptiveModel
from farm.utils import initialize_device_settings
from farm.infer import Inferencer, QAInferencer
from farm.data_handler.processor import SquadProcessor
from farm.modeling.tokenization import Tokenizer
from pathlib import Path
device, n_gpu = initialize_device_settings(use_cuda=True)

model_name = "local_models/xlm-roberta-base-squad2"

# Load a Tokenizer
tokenizer = Tokenizer.load(
    pretrained_model_name_or_path=model_name
)
# stick it into a Processor
label_list = ["start_token", "end_token"]
processor = SquadProcessor(
    tokenizer=tokenizer,
    max_seq_len=256,
    label_list=label_list,
    data_dir=Path("../data/squad20"),
)

# Convert the local transformers model to FARM style
model = AdaptiveModel.convert_from_transformers(model_name,device=device,task_type="question_answering")

# Put everything into a Inferencer - at this point you have a ReaderModel
infer = Inferencer(model=model,processor=processor,task_type="question_answering")

# Testing the Reader Model
QA_input = [
    {
        "questions": ["Who counted the game among the best ever made?"],
        "text": "Twilight Princess was released to universal critical acclaim and commercial success. It received perfect scores from major publications such as 1UP.com, Computer and Video Games, Electronic Gaming Monthly, Game Informer, GamesRadar, and GameSpy. On the review aggregators GameRankings and Metacritic, Twilight Princess has average scores of 95% and 95 for the Wii version and scores of 95% and 96 for the GameCube version. GameTrailers in their review called it one of the greatest games ever created."
    }]
print(infer.inference_from_dicts(dicts=QA_input)[0])

Btw what are you using this model for? Are you using it in haystack?

datalee commented 3 years ago

@Timoeller use the 1st way,i have the error: OSError: Can't load config for 'wptoux/albert-chinese-large-qa'. Make sure that: - 'wptoux/albert-chinese-large-qa' is a correct model identifier listed on 'https://huggingface.co/models' - or 'wptoux/albert-chinese-large-qa' is the correct path to a directory containing a config.json file

datalee commented 3 years ago

@Timoeller use the 2th way, also have the error: 01/28/2021 10:39:11 - INFO - farm.modeling.tokenization - Loading tokenizer of type 'AlbertTokenizer' Traceback (most recent call last): File "../cov_demo.py", line 13, in <module> pretrained_model_name_or_path=model_name File ".\Anaconda3\lib\site-packages\farm\modeling\tokenization.py", line 83, in load ret = AlbertTokenizer.from_pretrained(pretrained_model_name_or_path, keep_accents=True, **kwargs) File ".\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 1428, in from_pretrained return cls._from_pretrained(*inputs, **kwargs) File ".\Anaconda3\lib\site-packages\transformers\tokenization_utils_base.py", line 1575, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File ".\Anaconda3\lib\site-packages\transformers\tokenization_albert.py", line 155, in __init__ self.sp_model.Load(vocab_file) File ".\AppData\Roaming\Python\Python36\site-packages\sentencepiece.py", line 118, in Load return _sentencepiece.SentencePieceProcessor_Load(self, filename) TypeError: not a string

my transformers model like this: image

`

Timoeller commented 3 years ago

Hey I can replicate your problems with wptoux/albert-chinese-large-qa

As they state in the model card you have to use the BertTokenizer.

Important: use BertTokenizer

So you can load the model like:

infer = QAInferencer.load(model_name_or_path="wptoux/albert-chinese-large-qa", tokenizer_class="BertTokenizer", task_type="question_answering", gpu=True)
datalee commented 3 years ago

Hey I can replicate your problems with wptoux/albert-chinese-large-qa

As they state in the model card you have to use the BertTokenizer.

Important: use BertTokenizer

So you can load the model like:

infer = QAInferencer.load(model_name_or_path="wptoux/albert-chinese-large-qa", tokenizer_class="BertTokenizer", task_type="question_answering", gpu=True)

not work,you can try

Timoeller commented 3 years ago

It works for me, that is why I posted it.

Which FARM version are you using? Do you have all requirements installed? What is your error?

datalee commented 3 years ago

It works for me, that is why I posted it.

Which FARM version are you using? Do you have all requirements installed? What is your error? farm 0.5.0 farm-haystack 0.6.0 image

Timoeller commented 3 years ago

Please update to latest versions haystack 0.7.0 and try again.

If you encounter a problem in haystack please also raise the issue there, it is easier to track progress, help you and let others find the solution to your problem as well.

datalee commented 3 years ago

Please update to latest versions haystack 0.7.0 and try again.

If you encounter a problem in haystack please also raise the issue there, it is easier to track progress, help you and let others find the solution to your problem as well.

en,in chinese , must add a mirror AutoModel.from_pretrained('bert-base-uncased', mirror='tuna')

Timoeller commented 3 years ago

Nice, thanks for the update.

So the issue is fixed? Closing now. Feel free to re open, or open an issue in haystack when it is related to the Reader Models there.