BuilderIO / gpt-crawler

Crawl a site to generate knowledge files to create your own custom GPT from a URL
https://www.builder.io/blog/custom-gpt
ISC License
18.14k stars 1.88k forks source link

Disallowed Special Token #130

Open MrAshRhodes opened 5 months ago

MrAshRhodes commented 5 months ago

Hi,

Ive been trying to scrape the openai api docs for testing and im constantly getting the following error. Would anyone know how to resolve?

Im using the docker image if it makes any difference.

Found 634 files to combine...
file:///home/gpt-crawler/node_modules/gpt-tokenizer/esm/GptEncoding.js:78
                throw new Error(`Disallowed special token found: ${match[0]}`);
                      ^

Error: Disallowed special token found: <|endoftext|>
    at GptEncoding.encodeGenerator (file:///home/gpt-crawler/node_modules/gpt-tokenizer/esm/GptEncoding.js:78:23)
    at GptEncoding.isWithinTokenLimit (file:///home/gpt-crawler/node_modules/gpt-tokenizer/esm/GptEncoding.js:147:20)
    at addContentOrSplit (file:///home/gpt-crawler/dist/src/core.js:131:28)
    at write (file:///home/gpt-crawler/dist/src/core.js:156:15)
    at async file:///home/gpt-crawler/dist/src/main.js:4:1

Node.js v20.10.0
Crawling complete..
ryanspice commented 5 months ago

+1 to this, after crawling 5000+ pages sad to have it fail!

MrAshRhodes commented 5 months ago

@ryanspice I think I've got a workaround.

I made changes to the following files GptEncoding.js and specialTokens.jsin node_modules\gpt-tokenizer\esm\

Change the following function in GptEncoding.js

    encodeGenerator(lineToEncode, { allowedSpecial = new Set(), disallowedSpecial = new Set(), } = {}) {
        // Assuming ALL_SPECIAL_TOKENS is a placeholder for all special tokens
        if (disallowedSpecial.has(ALL_SPECIAL_TOKENS)) {
            disallowedSpecial = new Set(this.specialTokenMapping.keys());
            allowedSpecial.forEach(token => disallowedSpecial.delete(token));
        }

        // Check for disallowed tokens in the input
        disallowedSpecial.forEach(token => {
            if (lineToEncode.includes(token)) {
                throw new Error(`Disallowed special token found: ${token}`);
            }
        });

        return this.bytePairEncodingCoreProcessor.encodeNative(lineToEncode, allowedSpecial);
    }

Then in specialTokens.js replace these bits.

export const EndOfText = "<EOT>";
export const FimPrefix = "<FimPrefix>";
export const FimMiddle = "<FimMiddle>";
export const FimSuffix = "<FimSuffix>";
export const ImStart = "<ImStart>";
export const ImEnd = "<ImEnd>";
export const ImSep = "<ImSep>";
export const EndOfPrompt = "<EndOfPrompt>";