Open deliahu opened 4 years ago
Motivation
- Reduce latency when multiple requests are required
- Stream output from the predictor as it's generated
When will this feature become available?
@da-source we haven't scheduled this one yet; we usually plan about two weeks at a time.
Would it be possible to change your API implementation so that you can make a single HTTP request to the API (or multiple distinct requests if necessary), rather than relying on streaming the results?
@da-source we haven't scheduled this one yet; we usually plan about two weeks at a time.
Would it be possible to change your API implementation so that you can make a single HTTP request to the API (or multiple distinct requests if necessary), rather than relying on streaming the results?
I would like to deploy a large finetuned GPT-2 model. Since it is so large it takes a while to get the whole output and I would like to stream partial outputs instead of waiting for the whole thing. Something like AI Dungeon 2
@da-source we haven't scheduled this one yet; we usually plan about two weeks at a time.
Would it be possible to change your API implementation so that you can make a single HTTP request to the API (or multiple distinct requests if necessary), rather than relying on streaming the results?
I’m using compressed* large GPT-2 model: https://bellard.org/nncp/gpt2tc.html
Motivation
- Reduce latency when multiple requests are required
- Stream output from the predictor as it's generated
Hi! Are there any updates on when this will be coming out?
@mutal we haven't come up with a timeline for it yet. We'll keep this ticket updated as we go along. Is this urgent to you? And to re-iterate what @deliahu has mentioned before, we usually plan about two weeks at a time.
@mutal we haven't come up with a timeline for it yet. We'll keep this ticket updated as we go along. Is this urgent to you? And to re-iterate what @deliahu has mentioned before, we usually plan about two weeks at a time.
I was hoping to implement a project with scalable infrastructure and websockets this month, so it would be nice if you could add this feature as soon as possible.
@mutal we haven't come up with a timeline for it yet. We'll keep this ticket updated as we go along. Is this urgent to you? And to re-iterate what @deliahu has mentioned before, we usually plan about two weeks at a time.
It would be very helpful for me if this feature would become available. When you say two weeks at a time, does that mean you plan to add it the week after the next one?
@mutal @da-source It appears that you have some urgency with regards to this feature.
Unfortunately, this feature is not a priority for Cortex for the next few weeks.
If I were in your position and wanted to ship something in the next month or so, I would try the workaround suggested here to use Cortex for your project.
Feel free to watch for notifications on this ticket. When the team has decided to prioritize this ticket, it will be moved from the to prioritize
column to current sprint
. If it remains in the to prioritize
column it means that the team has decided that other features are a higher priority than this feature.
@mutal @da-source It appears that you have some urgency with regards to this feature.
Unfortunately, this feature is not a priority for Cortex for the next few weeks.
If I were in your position and wanted to ship something in the next month or so, I would try the workaround suggested here to use Cortex for your project.
Feel free to watch for notifications on this ticket. When the team has decided to prioritize this ticket, it will be moved from the
to prioritize
column tocurrent sprint
. If it remains in theto prioritize
column it means that the team has decided that other features are a higher priority than this feature.
The workoaround that you have suggested doesn't work for me, because it means restarting the process (something I'm trying to avoid) on each call. In the meantime, I'll try to find a way to create a websockets on the Cortex instances myself. It shouldn't be too hard. Based on this, I'll have to replace localhost
with the IP of the Cortex's AWS instance. Any ideas on how to get the IP of the instances which Cortex spins up?
+1
Motivation