Error in making prediction on CPU after training the model on GPU

deo4kyo commented 5 years ago

Hi, I trained the model on GPU according to tutorial.

reader = BertQA(bert_model='bert-base-multilingual-cased',
                train_batch_size=256,
                learning_rate=3e-5,
                num_train_epochs=2,
                do_lower_case=False,
                verbose_logging=True,
                output_dir='./temp')

reader.fit(X=(train_examples, train_features))

And before dumping the model, send it to CPU.

reader.model.to('cpu')
reader.device = torch.device('cpu')

But I try to make a prediction on CPU, then following error occurs...

query = 'some sample query...'
prediction = cdqa_pipeline.predict(X=query)

--------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-79-c881b3585457> in <module>
      1 query = ''some sample query...''
----> 2 prediction = cdqa_pipeline.predict(X=query)

~/anaconda3/lib/python3.7/site-packages/cdqa/pipeline/cdqa_sklearn.py in predict(self, X, return_logit)
    158                                                      metadata=self.metadata)
    159             examples, features = self.processor_predict.fit_transform(X=squad_examples)
--> 160             prediction = self.reader.predict((examples, features), return_logit)
    161             return prediction
    162 

~/anaconda3/lib/python3.7/site-packages/cdqa/reader/bertqa_sklearn.py in predict(self, X, return_logit)
   1220             with torch.no_grad():
   1221                 batch_start_logits, batch_end_logits = self.model(
-> 1222                     input_ids, segment_ids, input_mask)
   1223             for i, example_index in enumerate(example_indices):
   1224                 start_logits = batch_start_logits[i].detach().cpu().tolist()

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    491             result = self._slow_forward(*input, **kwargs)
    492         else:
--> 493             result = self.forward(*input, **kwargs)
    494         for hook in self._forward_hooks.values():
    495             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    144                 raise RuntimeError("module must have its parameters and buffers "
    145                                    "on device {} (device_ids[0]) but found one of "
--> 146                                    "them on device: {}".format(self.src_device_obj, t.device))
    147 
    148         inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

Is there something else I need to do?

andrelmfarias commented 5 years ago

Hi @deo4kyo ,

After sending the model to CPU, Did you dump it using joblib.dump and loaded the dumped model on the QAPipeline?

Can you please paste here the whole code?

Thanks

deo4kyo commented 5 years ago

thanks, @andrelmfarias.

I followed this tutorial training

reader = BertQA(bert_model='bert-base-multilingual-cased', train_batch_size=256, learning_rate=3e-5, num_train_epochs=2, do_lower_case=False, verbose_logging=True, output_dir='./temp')

reader.fit(X=(train_examples, train_features))

reader.model.to('cpu') reader.device = torch.device('cpu')

joblib.dump(reader, os.path.join(reader.output_dir, 'bert_qa_vCPU.joblib'))

I downloaded the fine-tuned model to my local machine. And, inferenced like this..

from cdqa.utils.download import download_model, download_bnpp_data

download_bnpp_data(dir='./data/bnpp_newsroom_v1.1/') download_model(model='bert-squad_1.1', dir='./models')

df = pd.read_csv('./data/bnpp_newsroom_v1.1/bnpp_newsroom-v1.1.csv', converters={'paragraphs': literal_eval}) df = filter_paragraphs(df) df.head()

cdqa_pipeline = QAPipeline(reader='./models/bert_qa_vCPU.joblib') cdqa_pipeline.fit_retriever(X=df)

query = 'How long BNP Paribas and IBM Services have a partnership?' prediction = cdqa_pipeline.predict(X=query) <--- ...An error occurs here...

... RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

But, It worked fine when I ran this code.

cdqa_pipeline.cuda()

query = 'How long BNP Paribas and IBM Services have a partnership?' prediction = cdqa_pipeline.predict(X=query)

print('query: {}'.format(query)) print('answer: {}'.format(prediction[0])) print('title: {}'.format(prediction[1])) print('paragraph: {}'.format(prediction[2]))

query: How long BNP Paribas and IBM Services have a partnership? answer: 8 years title: BNP Paribas Signs an Agreement with IBM Services to further deploy its Cloud Strategy paragraph: In this context, BNP Paribas and IBM Services have announced today the renewal, for a duration of 8 years, of their partnership which has enabled to create in 2003 the IT services company BNP Paribas Partners for Innovation (BP2I), a joint venture held equally by BNP Paribas and IBM. This agreement will allow BNP Paribas to continue to deploy its Cloud approach thanks to IBM Services solutions.

What is the problem...?

deo4kyo commented 5 years ago

I found something strange.

It works well with model received using the download_model function from cdqa.utils.download. (bert_qa_vCPU-sklearn.joblib)

However, it doesn't work well with the models i've trained by following the tutorials provided here.

So I did some tests as below...

from sklearn.externals import joblib model1 = joblib.load('./models/bert_qa_vCPU-sklearn.joblib') # -> download model.. model2 = joblib.load('./models/bert_my_qa_vCPU-sklean.joblib') # -> my fined-tuned model..

type(model1.model), type(model2.model)

(pytorch_pretrained_bert.modeling.BertForQuestionAnswering, torch.nn.parallel.data_parallel.DataParallel)

two results are different...so I added a line of code as below.

cdqa_pipeline.reader.model = cdqa_pipeline.reader.model.module

It works fine now... If you don't mind, Would you explain about this?

andrelmfarias commented 5 years ago

Maybe it's related with last updates by Hugging Face...

I will investigate it further and make the necessary changes if needed

andrelmfarias commented 5 years ago

I just tried to train a new model and when print type(model.model) I get

pytorch_pretrained_bert.modeling.BertForQuestionAnswering

Not torch.nn.parallel.data_parallel.DataParallel...

Did you train the model with multiple GPUs with distributed training?

Thanks

deo4kyo commented 5 years ago

Hi, andrelmfarias. I used 8 GPUs(distributed training).

But now I understand what's wrong. Thanks for your help ndrelmfarias.

cdqa-suite / cdQA

Error in making prediction on CPU after training the model on GPU #238