anaclumos / signalkite

🪁 notifies you based on your algo
http://hn.cho.sh
MIT License
130 stars 6 forks source link

Add zh-TW locale in translate.py #18

Closed PeterDaveHello closed 1 year ago

PeterDaveHello commented 1 year ago

It'd be great to have Traditional Chinese here, though I'm not sure if this is enough for OpenAI to handle the phrases, or need more prompts to make sure it can recognize the difference.

vercel[bot] commented 1 year ago

Someone is attempting to deploy a commit to a Personal Account owned by @anaclumos on Vercel.

@anaclumos first needs to authorize it.

anaclumos commented 1 year ago

Thank you for your PR! Unfortunately, the biggest reason Taiwanese is not supported, is because DeepL translator does not support Taiwanese. I asked them but so far there are no clear indication when it will be added. I am planning on using either Google Translate or GPT-4 to handle Taiwanese + many other languages.

image

PeterDaveHello commented 1 year ago

Oh, got it, I thought OpenAI and DeepL are using in the same time, DeepL indeed only supports Simplified Chinese, so we need to wait for #12, right?

anaclumos commented 1 year ago

Certainly. However, the issue lies in the fact that my internal testing has revealed GPT-4 to be excessively costly. At present, I am expending approximately $50 per day to cover the existing regions, but the expenses associated with GPT-4 will be several times higher than that. Therefore, we must first come up with a financially feasible solution to sustain our operations.

PeterDaveHello commented 1 year ago

Any chance to share the breakdown of the costs? Just wondering which part cost the most and if it can be optimized.

anaclumos commented 1 year ago

Translations account for around 90% of the expenses. While the summarization feature, powered by GPT-3.5, performs exceptionally well for English, it tends to produce inaccurate results when dealing with other languages, especially non-Latin ones. This is particularly challenging for Korean (my language) and Taiwanese (your language).

I conducted an internal benchmark test using GPT-4, but the costs remained unchanged. As someone who holds Taiwan dear, I have considered adding individual Taiwanese support. However, my ultimate goal is to have a finely-tuned LLM for translation purposes, which can be run locally and eliminate costs.

PeterDaveHello commented 1 year ago

Thanks for the sharing! CJK issue is always challenging 😅 I was not aware of the cost of translation would be so significant here though!

anaclumos commented 1 year ago

@PeterDaveHello Hey. Do you happen to know the ISO 639-3 code for Taiwanese?

anaclumos commented 1 year ago

@PeterDaveHello Seems like Min Nan Chinese or Hokkien "nan". Can you verify this?

anaclumos commented 1 year ago

Ref: Project Linguine @ Sunghyun Cho

anaclumos commented 1 year ago

https://github.com/anaclumos/heimdall/blob/84f75ef43d7fba2bd78c33f1a68816542a08c8d3/web/src/i18n.ts#L57

PeterDaveHello commented 1 year ago

hmmm ... That'll be different, Traditional Chinese is still Chinese 😅

anaclumos commented 1 year ago

@PeterDaveHello Oh? Does zh-TW refer to multiple things? Could you kindly provide the ISO 639-3 code for the locale you want?

anaclumos commented 1 year ago

Also, how different is zh-HK from zh-TW?

PeterDaveHello commented 1 year ago

I'm not familiar with ISO 639-3, but looks like it doesn't distinguish different variations of Chinese language?

PeterDaveHello commented 1 year ago

The zh in zh-HK & zh-TW means Chinese, which seems to be defined in ISO 639-1. HK and TW is the country code seem to be defined in ISO 3166-1, means Hong Kong and Taiwan. Just like en-US and en-UK, they are both English, but in the different region, the language can be very different.

anaclumos commented 1 year ago

I see your concern. I am fairly sure that zh-TW is the Traditional Chinese mainly used in Taiwan.

anaclumos commented 1 year ago

https://cho.sh/r/C1CF90

Also, I have come up with the complete list of locales I will support, so I hope everyone is happy now☺️

PeterDaveHello commented 1 year ago

@anaclumos will zh-tw support also be listed on https://hn.cho.sh/?

anaclumos commented 1 year ago

@PeterDaveHello In the end, yes. I'm working on v2 for this service, with much delicate language support.

anaclumos commented 1 year ago

I expect other languages to arrive by the end of July.

PeterDaveHello commented 1 year ago

Got it, thanks again!

anaclumos commented 1 year ago

Is this the correct locale that you wanted?

Linguine Engine 結合 Google Translate、Azure Translate、DeepL 翻譯 150+ 種語言,覆蓋全球 99.9% 的受眾。

https://cho.sh/r/6AF7F7

PeterDaveHello commented 1 year ago

Looks so, may need longer and source string to be more precisely, but so far so good 👍

anaclumos commented 1 year ago

Heimdall Engine is nigh!

@PeterDaveHello Would you want to alpha-test zh-TW? If so, please let me know with the options:

PeterDaveHello commented 1 year ago

@anaclumos sure, thanks for the invitation! Daily 2 PM would be good for me, please use heimdall [at] peterhsu.tw 🤩

PeterDaveHello commented 1 year ago

@anaclumos just let you know that I did receive that mail about 3 days ago, the title is about Rust/JVM, but I received it twice, and no more other mails.

About the result of the translation, it looks like at least the characters are all in Traditional Chinese, and some of the terms are still in Simplified Chinese form, but it's at least something. From what I know, even GPT-4 can't handle it very properly with just basic prompts 😅

anaclumos commented 1 year ago

It's here. Working on a new website & subscription form, but you can do it right now here. https://newsletters.cho.sh/subscription/form

PeterDaveHello commented 1 year ago

Cool! Not sure if it's intended to select all lists by default? I'm sure I'm not capable of speaking such languages, that might even waste your resource/money even more?

anaclumos commented 1 year ago

That’s because it’s a temporary form. The real one is at https://hn.cho.sh 😃

And no, it doesn’t cost me more money even though you sub to all! Maybe couple more pennies? ☺️

PeterDaveHello commented 1 year ago

Oh, got it, will use the real one then!

PeterDaveHello commented 1 year ago

@anaclumos, the GPT-4 Turbo model is no longer that expensive! Is that something still on the plan that you'd like to use? https://openai.com/pricing

anaclumos commented 1 year ago

@PeterDaveHello I tried to use it, but the GPT-4 Turbo is rate-limited at the moment. I'll continue to use GPT-4 until the rate limit loosens. Thanks for the suggestion though!

PeterDaveHello commented 1 year ago

@anaclumos just to confirm, it's not yet applied for Traditional Chinese, right?

BTW, what's your user tier?

anaclumos commented 1 year ago

@PeterDaveHello I supported traditional Chinese since July! https://hn.cho.sh/zh-Hant/

My user tier is 4 I think

PeterDaveHello commented 1 year ago

@PeterDaveHello I supported traditional Chinese since July! hn.cho.sh/zh-Hant

I mean but it's not powered by GPT-4, is it?

PeterDaveHello commented 1 year ago

My user tier is 4 I think

Tier 4 looks good :+1: because there's no TPD, only TPM, and it's basically the highest tier for gpt series models:

https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-four

Looking forward to have the increased limited :)

  • The models gpt-4-1106-preview and gpt-4-vision-preview are currently under preview with restrictive rate limits that make them suitable for testing and evaluations, but not for production usage. We plan to increase these limits gradually in the coming weeks with an intention to match current gpt-4 rate limits once the models graduate from preview. As these models are adopted for production workloads we expect latency to increase modestly compared to this preview phase.
anaclumos commented 1 year ago

@PeterDaveHello I supported traditional Chinese since July! hn.cho.sh/zh-Hant

I mean but it's not powered by GPT-4, is it?

No, none of the translations are GPT-4. They're either DeepL or Bing. GPT-4 Turbo Preview currently has a rate limit of 20 requests per minute and 100 requests per day, which is way too little for any production-level app.

Maybe I'll also look into switching to GPT-4 Turbo soon, but that being said, even changing to GPT-4 Turbo is still financially unsustainable. I should either look for a long-term sponsor or make a paid tier or something because I am burning almost $1K a month.

PeterDaveHello commented 1 year ago

Is GPT-4 still the most costly part of the entire project? I'm curious about the current cost breakdown of the whole project and how I might be able to help. However, for the summarization and analysis part of the articles, switching to GPT-4 Turbo as soon as possible should help, as there will be a significant price difference.

PeterDaveHello commented 12 months ago

GPT-4 Turbo Preview currently has a rate limit of 20 requests per minute and 100 requests per day, which is way too little for any production-level app.

@anaclumos the TPD limit seems to be gone, and the TPM(450,000) is even higher than the original GPT-4(300,000) now :tada:

image

https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-four

anaclumos commented 12 months ago

Will work on it, it's now under the name of Project Naroo.

PeterDaveHello commented 12 months ago

It seems to be still working in progress? Looking forward to try it when it's ready 👍

PeterDaveHello commented 6 months ago

@anaclumos would it be possible to use gpt-4o before the replacement is ready?

anaclumos commented 6 months ago

Already using it!

https://github.com/anaclumos/heimdall/commit/e2dcb57b800e6ce3e729776e46cb22df08975a7c

PeterDaveHello commented 6 months ago

oh, that's awesome 😍

PeterDaveHello commented 6 months ago

Is that change already deployed? I'm not sure because the prompts need more fine-tuning or it's not deployed yet, the most recent translation still looks very similar to the old ones 😅

anaclumos commented 6 months ago

Ohhhh... I get what you mean. DeepL, not GPT-4o, handles the translation part. GPT-4o only handles the initial English context generation part. I do think it'll be beneficial to move the translation to GPT-4o tho.

PeterDaveHello commented 6 months ago

Oh, okay, that makes sense. Just DeepL can't handle Traditional Chinese, GPT-4 serious may still be the most powerful tool to help Traditional Chinese 😆

anaclumos commented 5 months ago

@PeterDaveHello Heimdall now translates using GPT-4o, starting tomorrow. Please check the quality.

PeterDaveHello commented 5 months ago

@anaclumos I checked the new translated context, and it's obviously improved.

Actually, I'm not sure if I fully understand the new approach. Is it using different prompts in different languages to generate the context in different languages? From what I know, the translated prompt might not lead to a better result, but using relatively precise English prompts to(or at least try to) control the LLM model could lead to a better result. Also, I'd like to help improve the prompt, that a single version of it might be more efficient, though we sure can just improve the Traditional Chinese version very precisely, but it'll also be great if all languages can benefit from it.