[Feature Request] Covert `BaseMessage` to alpaca format

Wendong-Fan commented 4 days ago

Required prerequisites

[X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[ ] Consider asking first in a Discussion.

Motivation

alpaca format: {"instruction": "...", "input": "...", "output": "..."}

Solution

No response

Alternatives

No response

Additional context

No response

CaelumF commented 23 hours ago

It's maybe not appropriate to convert BaseMessages directly to Alpaca format, since BaseMessages are part of a conversation structure, with roles and other information that is expected to be irrelevant to Alpaca format (How would we convert an alpaca message to a BaseMessage?). Typing it this way probably adds unhelpful coupling.

I think converting to/from strings is more appropriate. See https://github.com/camel-ai/camel/compare/master...alpaca_conversion_temp

(cc @lightaime)

Wendong-Fan commented 21 hours ago

Thanks @CaelumF , the scope of this issue is just to convert BaseMessage to alpaca format, didn't consider covert alpaca content back to BaseMessage, so additional information like role name could just be ignored, do we have requirement to make alpaca to BaseMessage?

covert from string using regex has 2 limitation

we need to produce string with the defined format, it's not natively supported by CAMEL so we still add some further implementation
using regex instead of extracting from a structured object is more risky and unreliable

CaelumF commented 2 hours ago

Yeah when it comes to the generation of alpaca items, it makes way more sense to do things in JSON, especially when structured output and JSON proficiency are available in the inference model. (I assume JSON is the textual representation you had in mind). The linked class can be converted to/from json as its a pydantic class.

But I also assume the plan/expectation is to have the alpaca entries just inside of the text portion of the messages as textual representations, rather than adding any specific fields to the BaseMessage? So the source information is always in one place in the form of text, and no type or contextual information is constraining the content of those messages (like we won't have an AlpacaBaseMessage or something)

The other textual representation that starts with ### Instruction ... is used for inference and training on base models. It's not found in datasets because its awkward for other purposes (though maybe sometimes data will be saved in that representation), but I added that to the pydantic class because pydantic already makes json easy and it is convenient for training and inference to have the representation with ###

Since in this conversion all of the information will be coming from one field of BaseMessage (content) which is always a String, and sometimes it will be useful to come from strings from other sources, it feels more versatile and less confusing to make the conversion just to work in terms of strings.

I can imagine some scenarios with multiple stages of data generation where it can be useful to go back from a textual representation to a validated object form too, in general I like what is communicated by the directions things can be converted. Or if we want to parse Alpaca items which were generated by a base model trained on that format, which it seems Alpaca was. (I'm not sure exactly why JSON wasn't just always used, maybe its because of newline handling or something)

If we want to make the conversion easily discoverable, we can add a to_alpaca function inside of BaseMessage that is a single line calling the publicly available conversion function that takes a string using the message property, to make it clear that only the content is coming from the basemessage

camel-ai / camel