deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.74k stars 248 forks source link

Cannot load model from local dir #299

Closed RobKnop closed 4 years ago

RobKnop commented 4 years ago

Describe the bug I want to do this with Haystack:

### Inference ############

# Load model
reader = FARMReader(model_name_or_path="../../saved_models/twmkn9/albert-base-v2-squad2", use_gpu=False)

I finetuned the model before and saved it to my local dir. Here the code:

### TRAINING #############
# Let's take a reader as a base model
reader = FARMReader(model_name_or_path="twmkn9/albert-base-v2-squad2", max_seq_len=512, use_gpu=False)

# and fine-tune it on your own custom dataset (should be in SQuAD like format)
train_data = "training_data"
reader.train(data_dir=train_data, train_filename="2020-02-23_answers.json", test_file_name='TEST_answers.json', use_gpu=False, n_epochs=1, dev_split=0.1)

Error message

03/28/2020 22:25:07 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
03/28/2020 22:25:07 - INFO - farm.modeling.adaptive_model -   Found files for loading 1 prediction heads
03/28/2020 22:25:07 - WARNING - farm.modeling.prediction_head -   Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {"training": true, "num_labels": 2, "ph_output_type": "per_token_squad", "model_type": "span_classification", "name": "QuestionAnsweringHead"}
03/28/2020 22:25:07 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
03/28/2020 22:25:07 - INFO - farm.modeling.prediction_head -   Loading prediction head from ../../saved_models/twmkn9/albert-base-v2-squad2/prediction_head_0.bin
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/Documents/CodingProjects/NLPofTimFerrissShow/QnA_with_Tim_Haystack.py in 
      51 
      52 # Load model
----> 53 reader = FARMReader(model_name_or_path="../../saved_models/twmkn9/albert-base-v2-squad2", use_gpu=False)
      54 # A retriever identifies the k most promising chunks of text that might contain the answer for our question
      55 # Retrievers use some simple but fast algorithm, here: TF-IDF

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/haystack/reader/farm.py in __init__(self, model_name_or_path, context_window_size, batch_size, use_gpu, no_ans_boost, top_k_per_candidate, top_k_per_sample, max_processes, max_seq_len, doc_stride)
     79         self.inferencer = Inferencer.load(model_name_or_path, batch_size=batch_size, gpu=use_gpu,
     80                                           task_type="question_answering", max_seq_len=max_seq_len,
---> 81                                           doc_stride=doc_stride)
     82         self.inferencer.model.prediction_heads[0].context_window_size = context_window_size
     83         self.inferencer.model.prediction_heads[0].no_ans_boost = no_ans_boost

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/infer.py in load(cls, model_name_or_path, batch_size, gpu, task_type, return_class_probs, strict, max_seq_len, doc_stride)
    139                 processor = InferenceProcessor.load_from_dir(model_name_or_path)
    140             else:
--> 141                 processor = Processor.load_from_dir(model_name_or_path)
    142 
    143         # b) or from remote transformers model hub

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/data_handler/processor.py in load_from_dir(cls, load_dir)
    189         del config["tokenizer"]
    190 
--> 191         processor = cls.load(tokenizer=tokenizer, processor_name=config["processor"], **config)
    192 
    193         for task_name, task in config["tasks"].items():

TypeError: load() missing 1 required positional argument: 'data_dir'

Expected behavior There is no error.

Additional context I use Haystack

To Reproduce Steps to reproduce the behavior

System:

Timoeller commented 4 years ago

Seems like the processor doesn't have the data_dir parameter set.

Did you fine-tune the model with farm or huggingface transformers?

Please also post the content of the processor config in here.

RobKnop commented 4 years ago

FARMReader(model_name_or_path="twmkn9/albert-base-v2-squad2

I used this model: twmkn9/albert-base-v2-squad2

RobKnop commented 4 years ago

I don't have any processor config. :/

Timoeller commented 4 years ago

There are conversion scripts to convert from transformers to farm. Please try these. If that doesn't work I will look into it tomorrow.

RobKnop commented 4 years ago

Do you thought of this script?

I used Case 2: reader = Inferencer.load("../../saved_models/twmkn9/albert-base-v2-squad2", task_type="question_answering")

I get the same error:

03/29/2020 11:51:55 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
03/29/2020 11:51:56 - INFO - farm.modeling.adaptive_model -   Found files for loading 1 prediction heads
03/29/2020 11:51:56 - WARNING - farm.modeling.prediction_head -   Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {"training": true, "num_labels": 2, "ph_output_type": "per_token_squad", "model_type": "span_classification", "name": "QuestionAnsweringHead"}
03/29/2020 11:51:56 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
03/29/2020 11:51:56 - INFO - farm.modeling.prediction_head -   Loading prediction head from ../../saved_models/twmkn9/albert-base-v2-squad2/prediction_head_0.bin
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/Documents/CodingProjects/NLPofTimFerrissShow/QnA_with_Tim_Haystack.py in 
----> 60 reader = Inferencer.load("../../saved_models/twmkn9/albert-base-v2-squad2", task_type="question_answering")

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/infer.py in load(cls, model_name_or_path, batch_size, gpu, task_type, return_class_probs, strict, max_seq_len, doc_stride)
    139                 processor = InferenceProcessor.load_from_dir(model_name_or_path)
    140             else:
--> 141                 processor = Processor.load_from_dir(model_name_or_path)
    142 
    143         # b) or from remote transformers model hub

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/data_handler/processor.py in load_from_dir(cls, load_dir)
    189         del config["tokenizer"]
    190 
--> 191         processor = cls.load(tokenizer=tokenizer, processor_name=config["processor"], **config)
    192 
    193         for task_name, task in config["tasks"].items():

TypeError: load() missing 1 required positional argument: 'data_dir'

As you said, in class Processor method "load()" there is an argument missing.

Timoeller commented 4 years ago

Hey @RobKnop thanks for trying out the conversion script. Though that might have been bad advice. The conversion should happen under the hood.

I could reproduce your bug with minimal code inside haystack by doing:

reader = FARMReader(model_name_or_path="twmkn9/albert-base-v2-squad2", use_gpu=False)
reader.save("data/albert-temp") 
reader2 = FARMReader(model_name_or_path="data/albert-temp", use_gpu=False)

So I opened an issue there deepset-ai/haystack/issues/49 and we will fix it. Thanks for reporting!

As a quick work-around (where you might need internet access) you could load the model by using the Huggingface model name "twmkn9/albert-base-v2-squad2" as before and continue training with this model. The model should be cached if you downloaded it before on the same machine.

Hope that helps.

tholor commented 4 years ago

Fixed by #300

ahotrod commented 4 years ago

@tholor

Still having a related problem after installing the latest FARM & Haystack versions. Unable to load my candidate models from local directories. Following is an example from your "playbook".

Loading from transformers' model storage works as always: reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", ...)

Loading the same model stored in a local directory does not: reader = FARMReader(model_name_or_path="/runs/roberta_base_squad2", ... )

Error message FileNotFoundError: No such file or directory: processor_config.json:

~/anaconda3/envs/nlp/lib/python3.7/site-packages/farm/infer.py in load(cls, model_name_or_path, batch_size, gpu, task_type, return_class_probs, strict, max_seq_len, doc_stride, extraction_layer, extraction_strategy)
    162                 processor = InferenceProcessor.load_from_dir(model_name_or_path)
    163             else:
--> 164                 processor = Processor.load_from_dir(model_name_or_path)
    165 
    166         # b) or from remote transformers model hub

~/anaconda3/envs/nlp/lib/python3.7/site-packages/farm/data_handler/processor.py in load_from_dir(cls, load_dir)
    177         # read config
    178         processor_config_file = Path(load_dir) / "processor_config.json"
--> 179         config = json.load(open(processor_config_file))
    180         # init tokenizer
    181         if "lower_case" in config.keys():

FileNotFoundError: [Errno 2] No such file or directory: /runs/roberta_base_squad2/processor_config.json'

Platform: Linux-4.15.0-91-generic-x86_64-with-debian-buster-sid Python version: 3.7.7 PyTorch version (GPU?): 1.4.0 (True) Tensorflow version (GPU?): 2.1.0 (True) transformers version: 2.7.0 farm version: 0.4.2 farm-haystack: 0.1.3

tholor commented 4 years ago

Hey @ahotrod , Right now we only support two options to load a FARMReader: a) Local FARM model b) Remote Transformers model

I guess your error comes up when trying to load a local model in Transformers format? I think it makes totally sense to support this and we will put it in the backlog. However, it might take some days as we are currently quite busy with a few other features.

RobKnop commented 4 years ago

I still have the issue. I updated to the latest FARM version (0.4.2) also tried the code of the current master branch.

I get the same error as stated above.

RobKnop commented 4 years ago

one more addition:

reader.save() works

reader.train(save_dir='xxxx') doesn’t work

tholor commented 4 years ago

I am currently exploring this, but have trouble reproducing the issue. Running the following script with latest haystack (https://github.com/deepset-ai/haystack/commit/5932aa01c3f1d84e030c38849fa89e1f5f2770c2) and FARM == 0.4.2 saves the trained model correctly in mymodel/debug and loads it back again into the reader (Ubuntu 18.04, GPU).

from haystack.reader.farm import FARMReader

reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
train_data = "data/squad_small"
reader.train(data_dir=train_data, train_filename="train.json", batch_size=4, n_epochs=1, save_dir="mymodel/debug")
reader = FARMReader(model_name_or_path="mymodel/debug", use_gpu=True)

The resulting directory with the saved model looks like this: image

A few questions to narrow this down together:

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.

benscottie commented 3 years ago

@tholor Hey! I am trying to load a locally saved Transformers model into the FARMReader. This is important to my project since the FARMReader seems to perform significantly better than the TransformersReader. Has there been any progress in allowing local Transformers models to be compatible with the FARMReader? Or are there any workarounds?

Timoeller commented 3 years ago

The only way I see is to load your model into FARM, then save it as a FARMmodel, then load it in haystack as a FARMReader.

You should be able to do all this to your local transformers model with (Please adjust the parameters accordingly):


    model = AdaptiveModel.convert_from_transformers(model_name_or_path, device=device, task_type="question_answering")

    tokenizer = Tokenizer.load(pretrained_model_name_or_path=model_name_or_path,do_lower_case=do_lower_case)
    processor = SquadProcessor(
        tokenizer=tokenizer,
        max_seq_len=256,
        label_list= ["start_token", "end_token"],
        metric="squad",
        train_filename=None,
        dev_filename=None,
        dev_split=0,
        test_filename=evaluation_filename,
        data_dir=data_dir,
        doc_stride=128,
    )
    model.connect_heads_with_processor(data_silo.processor.tasks, require_labels=True)

    model.save(save_dir)
    processor.save(save_dir)
benscottie commented 3 years ago

That did the trick, thanks!

Timoeller commented 3 years ago

Nice! Always happy to help, thanks for reporting back here.

cschloh commented 2 years ago

https://github.com/deepset-ai/FARM/issues/299#issuecomment-735745722

Hi @Timoeller , I am trying to verify whether I used the correct libraries/ imports to use your solution code:)

Issue 1

from farm.data_handler.processor import SquadProcessor from farm.data_handler.data_silo import DataSilo or from farm.data_handler import data_silo from farm.modeling.adaptive_model import AdaptiveModel

Issue 2

Error: module 'farm.data_handler.data_silo' has no attribute 'processor'

Could you confirm what import you used?

gokul427 commented 2 years ago

retriever = EmbeddingRetriever( document_store=document_store, embedding_model="all-MiniLM-L6-v2", use_gpu=False )

downloaded sentence-transformers/all-MiniLM-L6-v2 locally, but it does not have processor_config file

File "/Users/ghraj/myhaystack/lib/python3.8/site-packages/haystack/nodes/retriever/_embedding_encoder.py", line 48, in init self.embedding_model = Inferencer.load( File "/Users/ghraj/myhaystack/lib/python3.8/site-packages/haystack/modeling/infer.py", line 187, in load processor = InferenceProcessor.load_from_dir(model_name_or_path) File "/Users/ghraj/myhaystack/lib/python3.8/site-packages/haystack/modeling/data_handler/processor.py", line 1948, in load_from_dir config = json.load(open(processor_config_file)) FileNotFoundError: [Errno 2] No such file or directory: 'all-MiniLM-L6-v2/processor_config.json'

Timoeller commented 2 years ago

hey, this seems to be an issue rather related to our active project haystack. Please ask there next time.

When looking at your code you could try loading your model differently like we show it in our docs:

retriever = EmbeddingRetriever(
    document_store=document_store,
   embedding_model="sentence-transformers/all-MiniLM-L6-v2",
   model_format="sentence_transformers"
)
bala1802 commented 2 years ago

I don't have any processor config. :/

Hi, may I please know how this issue is resolved?

Timoeller commented 2 years ago

hey, please create a new issue in https://github.com/deepset-ai/haystack/issues and describe your problem in more detail there, please. This issue was about FarmReader models, I believe you want to know about sentencetransformer models?

prathmeshgd commented 1 year ago
config = json.load(open(processor_config_file))

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/c/Venv_wsl/New folder/Fri_meeting/deberta-v3-large-squad2/processor_config.json' ERROR:posthog:error uploading: HTTPSConnectionPool(host='tm.hs.deepset.ai', port=443): Max retries exceeded with url: /batch/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

i am also getting same issue , i dont have any file name processor_config.json, but code is using this file when i run this in WSL system . code is working fine in jupyter notebook .