PygmalionAI / data-toolbox

Our data munging code.
GNU Affero General Public License v3.0
34 stars 9 forks source link

Refactor training example formats and add Alpaca format #33

Closed TearGosling closed 1 year ago

TearGosling commented 1 year ago

Long blogpost inbound.

TL;DR: New Alpaca formats. Refactored code to support more than two formats.

Many users of the Pyg models have found that our Metharme format, while a bit better than the old Pygmalion format, to be somewhat confusing and obtuse. In addition, front-end developers for UIs handling LLMs have noted that this unique format can put a strain on their development efforts as they must keep up with another format to handle. With these factors in mind, it's rather likely that our next Pygmalion models will be trained with the Alpaca format. And so I set out to add the Alpaca format. Except... there's a problem. As it turns out, there's about nine million different variations on the Alpaca format (and instruct formats in general). For now, I've decided to add three: the standard Alpaca format, a "minimal Alpaca" made by Henk from KoboldAI as well as "Henkpaca", another format he made which (in my own words) acts as a kind of hybrid between instruct and chat formats. It might genuinely be worth it to try this out.

In the beginning, this repo only needed to support one format: Metharme's. Then, way later on, I decided to add the old Pygmalion format. While it was quite different from the Metharme format, it was nevertheless rather simple to implement. And since there were only two formats, I only needed to write a few if-else statements, and that was that. Adding three more formats., however, means that that is no longer an option without having to write a bunch of elifs. A different approach was needed.

This approach comes in the idea of the TurnWrapper. Instead of having a bunch of different methods in the Turn class to try and handle all the different formats, we instead have an abstract TurnWrapper class which takes in a Turn as a parameter and has methods for both representing the turn as a string and just grabbing the "model marker" (if that makes any sense) segment of the format. Different formats can then inherit from this base class. The attributes of the Turn which is fed into it are linked in TurnWrapper - this means that no changes to the turn-processing code beyond simply wrapping it in a subclass of TurnWrapper are needed. A WRAPPER_MAP mapping is provided so that no if-else statements are needed in terms of selecting which format belongs to which wrapper. All of this also means that it is pretty damn easy to add a new format: just make a subclass of TurnWrapper, add it to WRAPPER_MAP, specify it as a valid format in the new VALID_FORMATS constant in toolbox/core/training_example.py, and you're good to go.

Honestly, I really have no idea why I wrote so much. It's like 12 AM lol, I'm just gonna merge this and go to bed