Open MrAshRhodes opened 5 months ago
+1 to this, after crawling 5000+ pages sad to have it fail!
@ryanspice I think I've got a workaround.
I made changes to the following files GptEncoding.js
and specialTokens.js
in node_modules\gpt-tokenizer\esm\
Change the following function in GptEncoding.js
encodeGenerator(lineToEncode, { allowedSpecial = new Set(), disallowedSpecial = new Set(), } = {}) {
// Assuming ALL_SPECIAL_TOKENS is a placeholder for all special tokens
if (disallowedSpecial.has(ALL_SPECIAL_TOKENS)) {
disallowedSpecial = new Set(this.specialTokenMapping.keys());
allowedSpecial.forEach(token => disallowedSpecial.delete(token));
}
// Check for disallowed tokens in the input
disallowedSpecial.forEach(token => {
if (lineToEncode.includes(token)) {
throw new Error(`Disallowed special token found: ${token}`);
}
});
return this.bytePairEncodingCoreProcessor.encodeNative(lineToEncode, allowedSpecial);
}
Then in specialTokens.js replace these bits.
export const EndOfText = "<EOT>";
export const FimPrefix = "<FimPrefix>";
export const FimMiddle = "<FimMiddle>";
export const FimSuffix = "<FimSuffix>";
export const ImStart = "<ImStart>";
export const ImEnd = "<ImEnd>";
export const ImSep = "<ImSep>";
export const EndOfPrompt = "<EndOfPrompt>";
Hi,
Ive been trying to scrape the openai api docs for testing and im constantly getting the following error. Would anyone know how to resolve?
Im using the docker image if it makes any difference.