Open rucha3 opened 3 years ago
@rucha3 Model trained with torch 1.5 will not work with 1.3.1- eia image. 1.5.1-eia container is under development, will be out around mid Jan :)
@tracycxw Thanks for your response. So, I also tried saving it as a traced model with torch.jit
in pytorch 1.3.1 but that didnt work as well. Is that expected?
@rucha3 Following links might help, can you check this first and see if anything missing? https://aws.amazon.com/blogs/machine-learning/fine-tuning-a-pytorch-bert-model-and-deploying-it-with-amazon-elastic-inference-on-amazon-sagemaker/
https://github.com/aws-samples/amazon-sagemaker-bert-pytorch/blob/master/bert-sm-python-SDK.ipynb
I have some issues when reproducing the error. If you still see the error, are you willing to provide more details of your setup and codes of how you train&save&load the model? You can send info to "amazon-ei-feedback@amazon.com".
0) what's your workflow? Training model on SM -> deploy it to SM endpoint using EI 1) what's your instance type 2) are you training model on SM? if so, which instance type you choose 3) what's your eia type
@tracycxw Here are the details of my setup that I had:
So, to ensure that I was doing everything right, this time I followed the exact same notebook that you shared - https://github.com/aws-samples/amazon-sagemaker-bert-pytorch/blob/master/bert-sm-python-SDK.ipynb.
I used a ml.m5.2xlarge
notebook instance with ml.eia2.medium
eia. Only change was instead of training and deploying on a different instance, I trained and deployed on the local
instance with a local SM session. And I was able to reproduce the same error - The worker dies while loading the model. It works fine when not using the accelerator for deployment.
Here is my notebook: https://github.com/rucha3/pytorch-bert-eia-demo/blob/main/bert_example.ipynb I have retained outputs of some cells to show failure logs.
@rucha3
Might due to transformer version issue. There's some requirements here: https://github.com/aws-samples/amazon-sagemaker-bert-pytorch/blob/master/code/requirements.txt
Can you try !pip install -r code/requirements.txt
at the beginning? I've tried train and deploy on the local but didn't see worker died in this case.
Hope this can solve your load model issue.
Further, If you have issue with inference with accelerator, you might want to check torch version. For now, you need to make sure you trace the model using torch 1.3.1.
@tracycxw Thanks a lot, using the latest transformers version worked! Initially I was using version 2.11.0 because thats what the model was trained with. So what finally worked was using latest transformers version 4.1.1 for tracing as well as in my inference script inside the container. I didnt have to trace it with pytorch 1.3.1 either, 1.5.0 worked just fine. Thanks again.
I have a custom python file for inference in which I have implemented the functions
model_fn
,input_fn
,predict_fn
andoutput_fn
. I have saved the model as a torchscript usingtorch.jit.trace
,torch.jit.save
and loading it usingtorch.jit.load
. Themodel_fn
implementation is as follows:This implementation works perfectly for the container with pytorch 1.5. But for container with torch 1.3.1 it exits abruptly when loading the pretrained model without any logs. The only line I see in the logs is
The worker dies and tries to restart, and the process repeats till I stop the container.
The model I am using is trained with pytorch 1.5. But since EI support is only supported till 1.3.1, I am using this container.
Things I have tried:
debug
andnotset
levels for logs. Didn't get any more info as to why model loading failsPytorchModel's deploy()
function withframework_version
as 1.3.1. Also tried it using the 1.3.1 container withouteia
. Has same behaviour everywhere.Am I doing something wrong or missing something crucial from the documentation? Any help would be much appreciated.
Logs for container with torch 1.3.1-eia