dreadnode / rigging

Lightweight LLM Interaction Framework
https://rigging.dreadnode.io
MIT License
191 stars 11 forks source link

Multilanguage peculiarity for Model.to_pretty_xml() #10

Closed L3G5 closed 2 months ago

L3G5 commented 3 months ago

I have the following toy example:

class Phrase(rg.Model):
    phrase: str 
    @classmethod
    def xml_example(cls) -> str:
        return Phrase(
            phrase=""
        ).to_pretty_xml()

class Rephrase(rg.Model):
    rephrase: str 
    @classmethod
    def xml_example(cls) -> str:
        return Rephrase(
            rephrase=""
        ).to_pretty_xml()

example_phrase = Phrase(phrase="こんにちは、お元気にお過ごしでしょうか。")
prompt = f"""Rephrase the phrase {Phrase.xml_tags()} in ten different ways. Each way should be between {Rephrase.xml_tags()}. {example_phrase.to_pretty_xml()}"""

Unfortunately, the current implementation of .to_pretty_xml() escapes non-English characters and results in '<phrase>&#12371;&#12435;&#12395;&#12385;&#12399;&#12289;&#12362;&#20803;&#27671;&#12395;&#12362;&#36942;&#12372;&#12375;&#12391;&#12375;&#12423;&#12358;&#12363;&#12290;</phrase>', which degrades the performance of some LLMs. This fact seems to force using something like Phrase.xml_start_tag()+example_phrase.phrase+Phrase.xml_end_tag() instead of example_phrase.to_pretty_xml() for simple examples in multilanguage setting and encourages to avoid complex queries (where rigging absolutely shines, from my experience with English models).

Did I miss how to do it in the right way? Or if I didn't miss, are there any plans to add multilanguage support?

Anyway, thank you for this amazing project!

L3G5 commented 3 months ago

Oops, I honestly spent like half an hour on trying to make it work, but turns out that for now just using html.unescape will work for my purposes.

To be clear,

import html
example_phrase = Phrase(phrase="こんにちは、お元気にお過ごしでしょうか。")
html.unescape(f"""Rephrase the phrase {Phrase.xml_tags()} in ten different ways. Each way should be between {Rephrase.xml_tags()}. {example_phrase.to_pretty_xml()}""")

produces just what I want to send to the model 🫠

monoxgas commented 3 months ago

I haven't spent much time looking at multilanguage support, this seems like a good opportunity to track down any issues. I'll dig in soon, thanks for the report!

monoxgas commented 2 months ago

I haven't dug too deep into this, but I was able to fix the unicode handling in to_pretty_xml in v2.0.2.

Going to close as complete for now, and we can re-open if more issues are found.