Documentation Support for Huggingface transformers

codecraf8 commented 4 years ago

Description

Please add a documentation on huggingface transformers.

Motivation

HF Transformers are state of art model selection. A guide on how to directly use them, is essential for any DL practitioner.

Additional context

(optional)

deliahu commented 4 years ago

@codecraf8 thanks for reaching out! Yes, I agree; in fact we just recently decided that we should update our tutorial to use Hugging Face's GPT-2 (https://github.com/cortexlabs/cortex/issues/1256). We should have that ready for the next release, which is expected to come late next week.

In the meantime, the existing tutorial still gives a good sense for how to use Cortex, and we have a few examples which use transformers that should serve as a good starting point:

In addition, on the master branch, we have pytorch/text-generator, which is what we'll use for the tutorial. The example should be able to run on v0.18 with a minor modification (removing the kind: SyncAPI from cortex.yaml).

Let us know if you have any questions!

codecraf8 commented 4 years ago

sounds good, thanks!

g-karthik commented 3 years ago

@deliahu I am unable to find these examples any more. Is there a tutorial on configuring/deploying a Cortex predictor for Hugging Face transformers to an AWS GPU instance for inference?

vishalbollu commented 3 years ago

@g-karthik you can find an example of deploying a hugging face transformer to a GPU instance: https://github.com/cortexlabs/cortex/tree/master/test/apis/realtime/text-generator.

Given that you've mentioned the predictor, I wanted to bring to your attention that the latest version of Cortex v0.36 expects your application to be packaged in a docker container as opposed to a python project with a predictor.py. The example I've linked to above complies with the latest version of Cortex. You can find more documentation at https://docs.cortex.dev/.

g-karthik commented 3 years ago

@vishalbollu thanks for sharing this! I was trying to run some existing code based on cortex 0.25.0 with its own predictor implemented, and faced some error like supervisor not listening which we weren't facing before.

I'm going to try setting up your example above with the latest cortex. Looking further though, I would need to package specific portions of a private repo I'm working with into the app in the first URL you linked - unlike in your self-contained main.py, some of my dependencies will need to come from local code paths in my private repo, not just from a requirements.txt. I'll try to get this working and reach out if I face any roadblocks.

vishalbollu commented 3 years ago

It's great to hear that you've got the example working.

A good way to pass these fields is by setting the env field to pass environment variables into your container. You can specify the path to the model in S3.
The image built by the Dockerfile must be a webserver. Uvicorn/gunicorn/fastapi/flask or a similar alternative must be used. It is generally a good idea to build an image per microservice so to reduce cross API dependencies. However, there are ways to use the same docker image to serve different models if you think that the dependencies are the same for now. You can follow my recommendation above. You can use environment variables to change the behaviour of your webserver by specifying environment variables such as S3_MODEL_PATH, MODEL_TYPE, HYPER_PARAMETER_1 etc in your API spec. These variables can be read in your main.py and your behaviour can change accordingly.

On Fri, Jun 25, 2021 at 3:20 AM Karthik Gopalakrishnan < @.***> wrote:

@vishalbollu https://github.com/vishalbollu So I was able to get your example working!

I have the following customizations to make now and wanted your help with:

my main.py should be able to read in a configuration file that primarily specifies which model class (GPT-2, BERT, etc.) to load and the path to the model checkpoint in S3 to pull the checkpoint from during API startup, and secondarily specifies some inference hyper-parameters

my Dockerfile should be common to all APIs I want to deploy (I will want to deploy multiple APIs for multiple checkpoints, as you can see above, but use the exact same code) -- in that sense, at the very least, uvicorn should not be tied to / executed within my Dockerfile itself

How do I go about making this happen? Suppose I want two APIs, 1 for GPT-2 small and 1 for GPT-2 XL, and both their checkpoints are stored in my S3 bucket.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortexlabs/cortex/issues/1262#issuecomment-868282787, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBJYH7G5KGM22PCDMDAHBTTUQU33ANCNFSM4P7FSOJQ .

--

Vishal Bollu

g-karthik commented 3 years ago

@vishalbollu I was able to set it up successfully!

However, the API is really unstable, and constantly keeps failing with a no healthy upstream error.

When I run cortex get:

env      realtime api           status     up-to-date   requested   last update   avg request   2XX   4XX   5XX
cortex   response-generator     live       1            1           3d15h         959.402 ms    509   2     194
cortex   response-generator-2   updating   0            1           19h27m        959.402 ms    509   2     194

There's a whole bunch of 5xx errors from testing. How do I resolve this? I need this API to be exceptionally stable.

vishalbollu commented 3 years ago

There could be a few reasons for why you may get 500s and no healthy upstream.

At this point it seems to be an implementation issue and not a Cortex issue. Maybe your container is crashing and/or you haven't provided enough resources.

I would begin by investigating the logs for your API https://docs.cortex.dev/clusters/observability/logging. You may be able to find useful information in https://docs.cortex.dev/workloads/realtime/troubleshooting.

On Tue, Jun 29, 2021 at 5:36 PM Karthik Gopalakrishnan < @.***> wrote:

@vishalbollu https://github.com/vishalbollu I was able to set it up successfully!

However, the API is really unstable, and constantly keeps failing with a no healthy upstream error.

When I run cortex get:

env realtime api status up-to-date requested last update avg request 2XX 4XX 5XX cortex response-generator live 1 1 3d15h 959.402 ms 509 2 194 cortex response-generator-2 updating 0 1 19h27m 959.402 ms 509 2 194

There's a whole bunch of 5xx errors from testing. How do I resolve this? I need this API to be exceptionally stable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortexlabs/cortex/issues/1262#issuecomment-870935511, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBJYH222N5AJ26WIPBCDJDTVI4HLANCNFSM4P7FSOJQ .

--

Vishal Bollu

g-karthik commented 3 years ago

@vishalbollu I'd already seen the logs, there wasn't anything useful in there. It isn't a code implementation error, it's well-tested for all edge cases.

It seems that the issue has to do with the fact that I was earlier trying GPT-2 XL on g4 instances. I'd likely need to upgrade to something better than g4 for GPT-2 XL. Do you have any recommendations for inference instance types (supported by cortex) with models like GPT-2 XL (1.5 billion), T5 (11 billion), etc.?

I tried deploying GPT-2 small on g4 and load tested it, it does seem to be failing with 5xx whenever the API is "updating" (to fetch more resources to account for my request load) and runs fine with 2xx when the API is "live". Is this expected behavior? Shouldn't an API in "updating" state never fail with 5xx, i.e., use the existing resources while more is being provisioned by the auto-scaling group?

vishalbollu commented 3 years ago

It seems that the issue has to do with the fact that I was earlier trying GPT-2 XL on g4 instances. I'd likely need to upgrade to something better than g4 for GPT-2 XL. Do you have any recommendations for inference instance types (supported by cortex) with models like GPT-2 XL (1.5 billion), T5 (11 billion), etc.?

Here is a blog https://towardsdatascience.com/choosing-the-right-gpu-for-deep-learning-on-aws-d69c157d8c86 describing the GPU instance types available on AWS. Cortex should be able to support all of them.

I tried deploying GPT-2 small on g4 and load tested it, it does seem to be

failing with 5xx whenever the API is "updating" (to fetch more resources to account for my request load) and runs fine with 2xx when the API is "live". Is this expected behavior? Shouldn't an API in "updating" state never fail with 5xx, i.e., use the existing resources while more is being provisioned by the auto-scaling group?

When you initially deploy the API, it will automatically be in an "updating" state because the "requested" number of workers does not match the "up to date" number of workers. When the status is in the updating state during the initial deployment, it will return 503s because no workers have been provisioned for it yet. Once the API becomes live, your requests will be satisfied. The API will continue to return 2XX in any subsequent API status changes to the "updating" state because the requests will still be routed to the previous version of the replica until the new version is ready.

It is understandable why the "updating" status can be confusing. Thanks for bringing this to our attention. Improvements into the status are our top priority.

@vishalbollu https://github.com/vishalbollu also, how can I ensure that

there are at least N (configurable) instances running at any given time for a real-time API endpoint?

It looks like there might be a confusion between workers and nodegroups. Workers represent the number of copies of your webserver running on the Cortex cluster and nodegroups represent the EC2 instances that cluster is using to run your workers. They are different because you can schedule multiple workers onto a single EC2 instance. For example, if you ran a p2.8xlarge instance which has 8 GPUs, you can schedule 8 workers each requiring 1 GPU onto a single instance. Cortex automatically provisions EC2 instances based on the min/max settings of your nodegroups and the resources requirements of your API's workers. The nodegroup min/max setting is typically used to reduce cold starts by eliminating ec2 instance provision time and setting an upperbound on spend.

To increase the minimum number of workers, you need to scale the API which is why you were able to get more workers when you ran a load test. To do it manually, you can change autoscaling: min_replicas and autoscaling: max_replicas in your API configuration https://docs.cortex.dev/workloads/realtime/configuration(double check that docs version matches your cortex version) and run cortex deploy.

On Tue, Jul 6, 2021 at 6:47 PM Karthik Gopalakrishnan < @.***> wrote:

@vishalbollu https://github.com/vishalbollu also, how can I ensure that there are at least N (configurable) instances running at any given time for a real-time API endpoint?

I tried scaling up my cluster for that API by doing:

cortex cluster scale --node-group ng-gpu --min-instances 10 --max-instances 10

However, I do not see the API getting more nodes when I run cortex get. It seems that the API will only "update" to request more nodes if I bombard the API with a lot of traffic via locust-based load testing.

Can you please point to the relevant doc for this? I basically want to ensure that even if 1 container crashes for whatever reason, there is another container ready upstream to take the traffic coming in.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortexlabs/cortex/issues/1262#issuecomment-875131944, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBJYH2TMLBSBXUKEGC7D6LTWOBXBANCNFSM4P7FSOJQ .

--

Vishal Bollu

cortexlabs / cortex