Closed ankitsmt211 closed 6 months ago
Yeah, we don't want to restrict and limit model from every side, rendering it uesless.To starve gpt of oxygen, until it coughs up few sentences for us, and dies. We want to just gently steer it, in it's full power.
If it has a perfect long gude, that explains every step perfectly, with code examples.. that's aweosme!
We should obvously optimize it, some simplest answers do not require those bloated responses. But quality should be our priority, and then optimizing for UI./UX.
These were the tests I used to benchmark and optimize responses.
class ChatGptServiceTest {
private static final Logger logger = LoggerFactory.getLogger(ChatGptServiceTest.class);
private Config config;
private ChatGptService chatGptService;
@BeforeEach
void setUp() {
config = mock();
when(config.getOpenaiApiKey()).thenReturn("your-api-key");
chatGptService = new ChatGptService(config);
}
@Test
void askToGenerateLongPoem() {
Optional<String> response = chatGptService.ask("generate a very long poem");
response.ifPresent(e -> logger.warn(e));
}
@Test
void askHowToSetupJacksonLibraryWithExamples() {
Optional<String> response = chatGptService.ask("How to setup Jackson library with examples");
response.ifPresent(e -> logger.warn(e));
}
@Test
void askDockerReverseProxyWithNginxGuide() {
Optional<String> response = chatGptService.ask("Docker reverse proxy with nginx guide");
response.ifPresent(e -> logger.warn(e));
}
@Test
void askWhyDoesItTakeYouMoreThan10SeconsToAnswer() {
Optional<String> response = chatGptService.ask("Working example of Command pattern in java, with all the classes required, explained in detail. Bonus points for UML diagrams.");
response.ifPresent(e -> logger.warn(e));
}
}
Can you run these, and post how long they took, and results. Just curious how it would all look with current UI/UX. (Since this is testing service directly, best to just ask bot these questions). Also curious if user would have to wait 2 minutes for an answer, and if that would look unintuitve/unfriendly for the user, because it's nor poperly communicated what is happening.
Regarding added context based on #question channel, so gpt knows it's java. I'm curious if it would backfire in other categories (for whatever reason), especially in other
category.
Because of that 'on a Java Q&A discord server', what happens if someone asks question and writes 'answer in python'. Or what if question is obviusly python, because there is python code attached, and gpt tries to rewrite it as Java or bastardizes it. What if mentioned libraries and frameworks are clearly from python ecosystem, would it answer within that context, or it will try to Javthon it?
Make sure to test some edge cases in different categories, and use some previous real-world failulres from #questions in your testsuite. Also include some successful answers by gpt, to check if you can notice any regressions. Just to be sure that this new prompt is objectively better, and that it won't make some other aspects worse by accidenet. :relaxed:
Can be improved with regards to length with a good prompt but it's not super consistent, will get back to this.
The response is not really compact, it needs a bit of playing with different prompts length, don't really feel like doing that atm. I'm going to undo length related changes, only keep context related changes. Because earlier length seems to better than what i did here.
These tests are not done more than a couple times, but seems to be relatively much better than original one.
Character count based on tests given by marko
with new prompt (3k token limit)
with new prompt plus changes(temperature) from @surajkumar (2k token limit)
with new prompt plus changes(temperature) from @surajkumar (3k token limit)
shorter responses and context is pretty solid.
with earlier one(3k token limit)
relatively longer responses and context totally depends on user's question.
Can you add this to your PR please:
/** The maximum number of tokens allowed for the generated answer */
private static final int MAX_TOKENS = 2_000;
/**
* This parameter reduces the likelihood of the AI repeating itself. A higher frequency penalty
* makes the model less likely to repeat the same lines verbatim. It helps in generating more
* diverse and varied responses.
*/
private static final double FREQUENCY_PENALTY = 0.5;
/**
* This parameter controls the randomness of the AI's responses. A higher temperature results in
* more varied, unpredictable, and creative responses. Conversely, a lower temperature makes the
* model's responses more deterministic and conservative.
*/
private static final double TEMPERATURE = 0.8;
/**
* n: This parameter specifies the number of responses to generate for each prompt. If n is more
* than 1, the AI will generate multiple different responses to the same prompt, each one being
* a separate iteration based on the input.
*/
private static final int MAX_NUMBER_OF_RESPONSES = 1;
These keen eyes will notices some changes to the values.
Token, freq, temperature are already set in the code. Do you want me to give them seperate var names?
Token, freq, temperature are already set in the code. Do you want me to give them seperate var names?
Yeah only because there's no Java docs on the openai lib and looking them up is a bother imo. Also doing this removes the whole "magic number" aspect but more so it's for the docs. I was gonna do it in another PR but since you're already here...
I also upped the TEMPERATURE I think that might be interesting.
Merging on basis of 1 review in approval of the changes, and more than 7 days of inactivity afterwards. Thanks :heart:
resolves #920
Note: There's no way to generate shorter responses, i could go down to really low using
BRIEF
as a keyword but that's very very short. Imo char limit shouldn't be priority. We can always paginate the response in embeds.If we cross limit of 2k chars atm, AIResponseParser class will automatically cut it into multiple short messages as mentioned in #928 .
Reducing
MAX_TOKEN
would just lead to lost responses at times.Bottom Line, when implementing "embeds" for rare responses that go over 4k limit we can either paginate or using gpt again on generated response by dropping more fillers or something otherwise most responses should very well fall under 4k limit.