Closed RobKnop closed 4 years ago
Seems like the processor doesn't have the data_dir parameter set.
Did you fine-tune the model with farm or huggingface transformers?
Please also post the content of the processor config in here.
FARMReader(model_name_or_path="twmkn9/albert-base-v2-squad2
I used this model: twmkn9/albert-base-v2-squad2
I don't have any processor config. :/
There are conversion scripts to convert from transformers to farm. Please try these. If that doesn't work I will look into it tomorrow.
Do you thought of this script?
I used Case 2:
reader = Inferencer.load("../../saved_models/twmkn9/albert-base-v2-squad2", task_type="question_answering")
I get the same error:
03/29/2020 11:51:55 - INFO - farm.utils - device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
03/29/2020 11:51:56 - INFO - farm.modeling.adaptive_model - Found files for loading 1 prediction heads
03/29/2020 11:51:56 - WARNING - farm.modeling.prediction_head - Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {"training": true, "num_labels": 2, "ph_output_type": "per_token_squad", "model_type": "span_classification", "name": "QuestionAnsweringHead"}
03/29/2020 11:51:56 - INFO - farm.modeling.prediction_head - Prediction head initialized with size [768, 2]
03/29/2020 11:51:56 - INFO - farm.modeling.prediction_head - Loading prediction head from ../../saved_models/twmkn9/albert-base-v2-squad2/prediction_head_0.bin
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/Documents/CodingProjects/NLPofTimFerrissShow/QnA_with_Tim_Haystack.py in
----> 60 reader = Inferencer.load("../../saved_models/twmkn9/albert-base-v2-squad2", task_type="question_answering")
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/infer.py in load(cls, model_name_or_path, batch_size, gpu, task_type, return_class_probs, strict, max_seq_len, doc_stride)
139 processor = InferenceProcessor.load_from_dir(model_name_or_path)
140 else:
--> 141 processor = Processor.load_from_dir(model_name_or_path)
142
143 # b) or from remote transformers model hub
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/data_handler/processor.py in load_from_dir(cls, load_dir)
189 del config["tokenizer"]
190
--> 191 processor = cls.load(tokenizer=tokenizer, processor_name=config["processor"], **config)
192
193 for task_name, task in config["tasks"].items():
TypeError: load() missing 1 required positional argument: 'data_dir'
As you said, in class Processor method "load()" there is an argument missing.
Hey @RobKnop thanks for trying out the conversion script. Though that might have been bad advice. The conversion should happen under the hood.
I could reproduce your bug with minimal code inside haystack by doing:
reader = FARMReader(model_name_or_path="twmkn9/albert-base-v2-squad2", use_gpu=False)
reader.save("data/albert-temp")
reader2 = FARMReader(model_name_or_path="data/albert-temp", use_gpu=False)
So I opened an issue there deepset-ai/haystack/issues/49 and we will fix it. Thanks for reporting!
As a quick work-around (where you might need internet access) you could load the model by using the Huggingface model name "twmkn9/albert-base-v2-squad2" as before and continue training with this model. The model should be cached if you downloaded it before on the same machine.
Hope that helps.
Fixed by #300
@tholor
Still having a related problem after installing the latest FARM & Haystack versions. Unable to load my candidate models from local directories. Following is an example from your "playbook".
Loading from transformers' model storage works as always:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", ...)
Loading the same model stored in a local directory does not:
reader = FARMReader(model_name_or_path="/runs/roberta_base_squad2", ... )
Error message FileNotFoundError: No such file or directory: processor_config.json
:
~/anaconda3/envs/nlp/lib/python3.7/site-packages/farm/infer.py in load(cls, model_name_or_path, batch_size, gpu, task_type, return_class_probs, strict, max_seq_len, doc_stride, extraction_layer, extraction_strategy)
162 processor = InferenceProcessor.load_from_dir(model_name_or_path)
163 else:
--> 164 processor = Processor.load_from_dir(model_name_or_path)
165
166 # b) or from remote transformers model hub
~/anaconda3/envs/nlp/lib/python3.7/site-packages/farm/data_handler/processor.py in load_from_dir(cls, load_dir)
177 # read config
178 processor_config_file = Path(load_dir) / "processor_config.json"
--> 179 config = json.load(open(processor_config_file))
180 # init tokenizer
181 if "lower_case" in config.keys():
FileNotFoundError: [Errno 2] No such file or directory: /runs/roberta_base_squad2/processor_config.json'
Platform: Linux-4.15.0-91-generic-x86_64-with-debian-buster-sid Python version: 3.7.7 PyTorch version (GPU?): 1.4.0 (True) Tensorflow version (GPU?): 2.1.0 (True) transformers version: 2.7.0 farm version: 0.4.2 farm-haystack: 0.1.3
Hey @ahotrod , Right now we only support two options to load a FARMReader: a) Local FARM model b) Remote Transformers model
I guess your error comes up when trying to load a local model in Transformers format? I think it makes totally sense to support this and we will put it in the backlog. However, it might take some days as we are currently quite busy with a few other features.
I still have the issue. I updated to the latest FARM version (0.4.2) also tried the code of the current master branch.
I get the same error as stated above.
one more addition:
reader.save()
works
reader.train(save_dir='xxxx')
doesn’t work
I am currently exploring this, but have trouble reproducing the issue.
Running the following script with latest haystack (https://github.com/deepset-ai/haystack/commit/5932aa01c3f1d84e030c38849fa89e1f5f2770c2) and FARM == 0.4.2 saves the trained model correctly in mymodel/debug
and loads it back again into the reader (Ubuntu 18.04, GPU).
from haystack.reader.farm import FARMReader
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
train_data = "data/squad_small"
reader.train(data_dir=train_data, train_filename="train.json", batch_size=4, n_epochs=1, save_dir="mymodel/debug")
reader = FARMReader(model_name_or_path="mymodel/debug", use_gpu=True)
The resulting directory with the saved model looks like this:
A few questions to narrow this down together:
reader.train()
with FARM 0.4.2 or is that an older model?TypeError: load() missing 1 required positional argument: 'data_dir'
)? This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.
@tholor Hey! I am trying to load a locally saved Transformers model into the FARMReader. This is important to my project since the FARMReader seems to perform significantly better than the TransformersReader. Has there been any progress in allowing local Transformers models to be compatible with the FARMReader? Or are there any workarounds?
The only way I see is to load your model into FARM, then save it as a FARMmodel, then load it in haystack as a FARMReader.
You should be able to do all this to your local transformers model with (Please adjust the parameters accordingly):
model = AdaptiveModel.convert_from_transformers(model_name_or_path, device=device, task_type="question_answering")
tokenizer = Tokenizer.load(pretrained_model_name_or_path=model_name_or_path,do_lower_case=do_lower_case)
processor = SquadProcessor(
tokenizer=tokenizer,
max_seq_len=256,
label_list= ["start_token", "end_token"],
metric="squad",
train_filename=None,
dev_filename=None,
dev_split=0,
test_filename=evaluation_filename,
data_dir=data_dir,
doc_stride=128,
)
model.connect_heads_with_processor(data_silo.processor.tasks, require_labels=True)
model.save(save_dir)
processor.save(save_dir)
That did the trick, thanks!
Nice! Always happy to help, thanks for reporting back here.
Hi @Timoeller , I am trying to verify whether I used the correct libraries/ imports to use your solution code:)
from farm.data_handler.processor import SquadProcessor from farm.data_handler.data_silo import DataSilo or from farm.data_handler import data_silo from farm.modeling.adaptive_model import AdaptiveModel
Error: module 'farm.data_handler.data_silo' has no attribute 'processor'
Could you confirm what import you used?
retriever = EmbeddingRetriever( document_store=document_store, embedding_model="all-MiniLM-L6-v2", use_gpu=False )
downloaded sentence-transformers/all-MiniLM-L6-v2 locally, but it does not have processor_config file
File "/Users/ghraj/myhaystack/lib/python3.8/site-packages/haystack/nodes/retriever/_embedding_encoder.py", line 48, in init self.embedding_model = Inferencer.load( File "/Users/ghraj/myhaystack/lib/python3.8/site-packages/haystack/modeling/infer.py", line 187, in load processor = InferenceProcessor.load_from_dir(model_name_or_path) File "/Users/ghraj/myhaystack/lib/python3.8/site-packages/haystack/modeling/data_handler/processor.py", line 1948, in load_from_dir config = json.load(open(processor_config_file)) FileNotFoundError: [Errno 2] No such file or directory: 'all-MiniLM-L6-v2/processor_config.json'
hey, this seems to be an issue rather related to our active project haystack. Please ask there next time.
When looking at your code you could try loading your model differently like we show it in our docs:
retriever = EmbeddingRetriever(
document_store=document_store,
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
model_format="sentence_transformers"
)
I don't have any processor config. :/
Hi, may I please know how this issue is resolved?
hey, please create a new issue in https://github.com/deepset-ai/haystack/issues and describe your problem in more detail there, please. This issue was about FarmReader models, I believe you want to know about sentencetransformer models?
config = json.load(open(processor_config_file))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/c/Venv_wsl/New folder/Fri_meeting/deberta-v3-large-squad2/processor_config.json' ERROR:posthog:error uploading: HTTPSConnectionPool(host='tm.hs.deepset.ai', port=443): Max retries exceeded with url: /batch/ (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))
i am also getting same issue , i dont have any file name processor_config.json, but code is using this file when i run this in WSL system . code is working fine in jupyter notebook .
Describe the bug I want to do this with Haystack:
I finetuned the model before and saved it to my local dir. Here the code:
Error message
Expected behavior There is no error.
Additional context I use Haystack
To Reproduce Steps to reproduce the behavior
System: