deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.74k stars 247 forks source link

How to convert the xlm_roberta(pytorch)model to onnx? #537

Closed laifuchicago closed 4 years ago

laifuchicago commented 4 years ago

To Author: (1) If I want to convert pytorch model( xlm-roberta) to onnx , is there any reference? How to set the parameters such as dummy input? (input id, token_type and attention mask ?) The figure is my xlmroberta trained by FARM ask deepset1 The following code is the sample (pytorch to onnx) #################################################### model_onnx_path = "model.onnx"

The inputs "input_ids", "token_type_ids" and "attention_mask" are torch tensors of shape batch*seq_len dummy_input = (input_ids, token_type_ids, attention_mask) input_names = ["input_ids", "token_type_ids", "attention_mask"] output_names = ["output"] ''' convert model to onnx ''' torch.onnx.export(model, dummy_input, model_onnx_path, input_names = input_names, output_names = output_names, \verbose=False) ####################################################################################### (2) If we use onnx, will your Haystack also can be retrieved by Elasticsearch?

Thank you Jonathan Sung

Timoeller commented 4 years ago

Hey @laifuchicago how about converting the model to transformers and using their onnx conversion notebook?

In transformers the conversion works for xlm-roberta. If it is larger than 2GB you have to set use_external_format=True and be sure to update to pytorch 1.6.0

I am not sure about the solution you proposed. I see you also use "token_type_ids" which are not used in xlm-roberta. So it most likely will not work.

laifuchicago commented 4 years ago

To Author: Now the question is can I convert my own model(already trained by my own data) or only their model?

Jonathan Sung

Timoeller commented 4 years ago

You can convert FARM models to transformer ones and then continue.

We have conversion scripts between transformers and FARM models, your use case is covered here

laifuchicago commented 4 years ago

To Author: After I transferred , I got this typeerror. Do you have any idea? Thank you. ask deepset4

ask deepset2

Jonathan Sung

Timoeller commented 4 years ago

I cannot reproduce that error. Did you install latest FARM + dependencies? Can you post the full code you used? The error says it cannot find the vocab file. XLM-R does not have a vocab file but uses a sentencepiece.bpe.model to tokenize. Possibly the model name string does not include "xlm-r"?

laifuchicago commented 4 years ago

To Author: This is the code that I use. ask deepset3

And these are what I imported:

from farm.modeling.adaptive_model import AdaptiveModel from farm.modeling.tokenization import Tokenizer from farm.infer import Inferencer import pprint from transformers.pipelines import pipeline import os from pathlib import Path

I think is XLMtokenizer, not Roberta, how can I fix it? Jonathan Sung

Timoeller commented 4 years ago

It does not load the correct tokenizer.

Check out https://github.com/deepset-ai/FARM/blob/master/farm/modeling/tokenization.py#L73 your model name is missing the "-" symbol.


btw is it a xlm-roberta large model that you train on a 4x V100 GPU?

laifuchicago commented 4 years ago

To Author: This is the model that I trained by my own data. So I changed the name. Jonathan Sung

Timoeller commented 4 years ago

Agreed : ) this very strict string based model loading is not optimal and should be changed towards auto functionality present in transformers.

Did the conversion work with "xlm-roberta" present in your model name string?

laifuchicago commented 4 years ago

To Author: After I converted my model to onnx, how can I use haystack Elasticsearch to do the inference? Or currently haystack does not support?

Jonathan Sung

laifuchicago commented 4 years ago

Hey @laifuchicago how about converting the model to transformers and using their onnx conversion notebook?

In transformers the conversion works for xlm-roberta. If it is larger than 2GB you have to set use_external_format=True and be sure to update to pytorch 1.6.0

I am not sure about the solution you proposed. I see you also use "token_type_ids" which are not used in xlm-roberta. So it most likely will not work.

But in the HuggingFace, why they still have token type ids? ask deepset6 Jonathan Sung

Timoeller commented 4 years ago

In general roberta models are not trained on the NSP task and therefore do not need token type IDs. When you run the transformers conversion from pytorch to onnx you are getting an output that token type ids are unused for xlm-r. Do you agree and also see this output?

If you need to know the details, please ask questions about ONNX conversion and special inputs related to transformers models in huggingface's transformers repository directly.

tanaysoni commented 4 years ago

Hi @laifuchicago, with #557, XLM-RoBERTa models can now be converted to ONNX format for inference.

You can refer to onnx_question_answering.py for an example.