LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.35k stars 364 forks source link

Em dash in DRY crashes koboldcpp #1118

Closed fuhrriel closed 2 months ago

fuhrriel commented 2 months ago

Describe the Issue Adding em dash "—" to the list of seq breakers for DRY causes a crash on first generation attempt. Issue occurs for me 100% of the time, regardless of model or preset used.

Additional Information: "dry_multiplier": 0.8, "dry_base": 1.75, "dry_allowed_length": 2, "dry_penalty_last_n": 320, "dry_sequence_breakers": ["\n", ":", "\"", "*", "\u2014"]

terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted (core dumped)
LostRuins commented 2 months ago

I cannot repro this issue. It works perfectly fine for me.

Can you send me a .json save with this issue?

fuhrriel commented 2 months ago

Sure: saved_story.json

I've reset settings to default and changed only DRY mult. to 0.8 and added "—" to sequence breaks. Crash happens any time I try to generate more/retry.

sais-github commented 2 months ago

I get hit with "[WinError -529697949] Windows Error 0xe06d7363" if i add the emdash in the kobold lite ui while using cublas Or "[WinError -1073741569] Windows Error 0xc00000ff" when using openblas. As soon as it's removed it functions again. Microsoft has no information relating to the cublas crash but the open blash crash is an ntstatus thing "STATUS_BAD_FUNCTION_TABLE" In debug it doesn't get past processing? "Processing 5 dry break strings...[WinError -1073741569] Windows Error 0xc00000ff" To be sure it wasn't the ui, it also runs into the same error in ST

LostRuins commented 2 months ago

It seems like DRY does not handle unicode strings correctly, it's calculating a substring of a unicode sequence (e.g. em dash is 226,128,148, it finds a partial match and tries to tokenize the substring of the sequence 128,148 alone during the GetOverlappingTokenSequences which is not a valid unicode character and causes a segfault.

I'm going to set it to skip overlap matches if the current sequence breaker is only extended unicode characters. Meanwhile @pi6am might want to take a look at this.

The hotfix is in https://github.com/LostRuins/koboldcpp/commit/c78690737c5ed92b59e06b81eb663d91ac55eb43

LostRuins commented 2 months ago

Should be fixed.