e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets! Makes: QA, RP, Classifiers.
MIT License
1.05k stars 144 forks source link

Random Data in Outputs #73

Open Coastline-3102 opened 1 month ago

Coastline-3102 commented 1 month ago

Hello,

I have been testing out both the role play and original pipelines, and I am seeing random undesired data in the outputs.

To use a recent rp pipeline run as an example, I deleted everything in raw_txt_input expect for a single custom .txt file of a book, and ran the pipeline (with a 10 subset size) using a local 70b model loaded via ollama.

Looking at the full_stories_list_complete_format.json that results, I see passages like

scene_card": "As you enter Ada's meticulously organized suite, the aroma of chemical substances wafts through the air, a subtle testament to her relentless passion for chemistry.

"scene_card": "As you knock on the door to Ada Clarke's suite in the bustling heart of Victorian London, you hear some muffled noises and what sounds like a shifting of objects from inside.

"story": "In order to add a custom email template in Salesforce, you can follow these steps: 1. Log in to your Salesforce account and navigate to the Setup page (gear icon at the top right corner).

"story": "In order to add a custom email template in Salesforce, you can follow these steps: 1. Log in to your Salesforce account and navigate to the Setup page (gear icon at the top right corner).

despite there being no references to Salesforce, Ada Clarke or Victorian London in the source I provided.

I poked around in the other files, and found that Ada and Victorian London are mentioned in files in the rp pipeline prompts folder (no idea where Salesforce is coming from...).

Considering these results, I am wondering if there is an issue with augmentoolkit or the way I am running it, or if I simply need a better or bigger model that is smart enough to know not to include this content. If the latter is the case, can anyone recommend an uncensored model that can run locally? With my hardware, a quantized 70b is likely the largest I can go.

Thanks!

Coastline-3102 commented 4 weeks ago

Following up... I have continued to play around with this to try and fix it. Working within my hardware limits, I have tested

both of which are larger than the previous 70b I was working with. While the previous model did output some (mangled) stories, both of these attempts seem unable to find emotions. Here is an example of what IO am seeing in the terminal:

Response:

The primary emotion of the scene is a warm, buoyant joy that fills the air like the scent of fresh blooms, an almost giddy delight that bubbles up from within and spills over in smiles and laughter. It's the kind of happiness that comes from shared moments of simple pleasure, like the gentle nose nudge of a horse or the anticipation of a hearty meal after a long day. It's a feeling that lights up the room more surely than any lamp, casting out shadows and drawing people together, making even the mundane tasks of carrying bags and ordering food feel charged with a special, shared energy. This joy is not the loud, boisterous type but rather a quiet, contented warmth that wraps around you like a familiar blanket, making the world feel just a bit cozier and more manageable. It's the joy of connection, of inside jokes, and of knowing glances, a feeling that makes the challenges of the day fade into the background, replaced by the simple pleasure of being present with someone who makes your heart feel light. ERROR - Above prompt resulted in error, probably the model's fault: Emotion failed validations

I'm not sure how exactly to fix this, as the output does not look obviously broken or miswritten. There is some overlap with this issue, but thus far changing models has not fixed it, and I am unsure how implement the context fix mentioned there.

e-p-armstrong commented 3 weeks ago

@Coastline-3102 Thanks for creating this issue! Sorry for not seeing it until now. I believe your analysis is on point, both problems look like a model issue. However it's very strange that a 70b is failing the RP datagen so severely -- most of the demo dataset was generated with llama 3 70b, and even smaller models should at least get the format right (and not talk about salesforce!)

The second issue is also a model/output format problem, emotions should begin with the emotion name in ALLCAPS: with a colon. What's odd is that I have personally used Mistral large to make data with RPTK successfully -- so maybe this is an issue with your inference engine, or sampling parameters? Maybe the inference engine has somehow overridden those of the pipeline? And you mentioned a custom input text, that could also be it if it is a difficult one in some way, if you are able to share that or the config I might be able to help diagnose the problem?

Coastline-3102 commented 3 weeks ago

Thanks for getting back to me!

However it's very strange that a 70b is failing the RP datagen so severely -- most of the demo dataset was generated with llama 3 70b, and even smaller models should at least get the format right (and not talk about salesforce!)

Good to know. I was also surprised that the 70b I used (midnight-miqu70b) failed so miserably. One of the reasons I tested that model is because I have used it in the past for similar workflows, where it has performed well and not rambled about stuff like salesforce. If I could fix the issue it such that I can get away with running something around a 70b instead of a bigger model, that would be nice. I can run higher models, but that really pushes the limit of my system and slows generation time to a crawl.

The second issue is also a model/output format problem, emotions should begin with the emotion name in ALLCAPS: with a colon. What's odd is that I have personally used Mistral large to make data with RPTK successfully -- so maybe this is an issue with your inference engine, or sampling parameters?

Strange. My initial thought is that maybe something is wrong with the context size of my model? Since the RP pipeline prompt is so large, I wonder if it is somehow "choaking" on it, and thus returning bad results? I will admit I am somewhat new to this, so I am unsure the best way to troubleshoot.

if you are able to share that or the config I might be able to help diagnose the problem?

Sure. I will say I have made fairly minimal changes to the base config:

API:
  API_KEY_A: key
  API_KEY_B: key2
  BASE_URL_A: http://localhost:5001/v1/
  BASE_URL_B: http://localhost:5001/v1/
  LOGICAL_MODEL_A: datacrystals/midnight-miqu103b-v1
  LOGICAL_MODEL_B: datacrystals/midnight-miqu103b-v1
PATH:
  DEFAULT_PROMPTS: ./prompts
  INPUT: ./raw_txt_input
  OUTPUT: ./output
  PROMPTS: ./prompts
PHASES:
  PHASE_INDEX: 3
  WORK_IN_PHASES: False
SYSTEM:
  COMPLETION_MODE: False
  CONCURRENCY_LIMIT: 20
  EMOTIONS: ['DOMINANCE', 'FEARLESSNESS', 'EMBARASSMENT', 'NIHILISM',
    'DETERMINATION', 'DESPERATION', 'LOSS', 'NOSTALGIA', 'ANTICIPATION',
    'TRUST', 'FEAR', 'DISORIENTATION', 'DEGRADATION']
  INCLUDE_CHUNK_IN_PROMPT: True
  MODE_A: api
  MODE_B: api
  PICK_EMOTION: True
  RP_PROMPT_END: ''
  RP_PROMPT_START: ''
  STOP: True
  SUBSET_SIZE: 30
  USE_MIN_P: True
  USE_SUBSET: True
  CHUNK_SIZE: 2000
SCRAPING:
  USE_LIGHTNOVELCO: False
  LNCO_BASE_URL: https://www.lightnovelworld.co
  LNCO_RANKING_URL: https://www.lightnovelworld.co/ranking
  LNCO_CHAPTER_COUNT: 5
  LNCO_NOVEL_COUNT: 5
  LNCO_WAIT_TIME: 10
  LNCO_MAX_WORKERS: 5

I have been using Ollama as my inference engine, and have not made any major changes to Ollama itself. In most cases I just directly run the model from the command line. For example, with the above config I would just run ollama run datacrystals/midnight-miqu103b-v1 followed by python run_augmentoolkit.py in another terminal.

And you mentioned a custom input text, that could also be it if it is a difficult one in some way.

I doubt that is causing the issue? While it is true that I'm not using one of the provided sample texts, it is just a normal novel that I have converted into a .txt file (and verified that it is not corrupted or anything) which should not be any more difficult than the moby_dick_sample.txt