[Feat] Large corpus spliting

axolotl-ai-cloud / axolotl

Go ahead and axolotl questions

https://axolotl-ai-cloud.github.io/axolotl/

Apache License 2.0

7.86k stars 865 forks source link

[Feat] Large corpus spliting #297

Closed maximegmd closed 1 year ago

maximegmd commented 1 year ago

Hello,

As we discussed on discord, pretraining is usually done on large chunks of text (books, entire web pages, articles) that are larger than the context size. With this in mind I propose that we introduce an automatic split mechanism for such datasets.

The reasoning behind doing this within Axolotl and not doing it ahead of time is that doing it ahead of time locks the dataset to a specific context size and would require a new generation pass to target another context size.

I also propose the following settings:

context_frac: 0.8: Splits the text up to context_frac * context size
split_regexes: A list of regex used to split the text, this can help improve the split points.
overlap_tokens: How many tokens overlap between two split chunks.

ashercn97 commented 1 year ago

Yess

theobjectivedad commented 1 year ago

winglian commented 1 year ago

I remember seeing how llama (v1) did pretraining and they included metadata I think on each set of 2048 tokens maybe? If we can track down the appropriate algorithm for this and how metadata for each row should be formatted, I think we can make some progress on this soon.

NanoCode012 commented 1 year ago

The reasoning behind doing this within Axolotl and not doing it ahead of time is that doing it ahead of time locks the dataset to a specific context size and would require a new generation pass to target another context size.

Regarding this, there is a pretraining_dataset config which allows streaming for completion type pre-training. Another thing is that thanks to a PR, the tokenization should be much faster than before.

maximegmd commented 1 year ago

I don't understand how that's relevant? Is there a sliding window with regards to tokenization when streaming completion data?

mhenrichsen commented 1 year ago

How about something like this?

class CompletionPrompter:
    """
    Prompter for completion
    """
    def __init__(self, context_size: int, context_frac: float = 0.8, overlap_tokens: int = 0):
        self.context_size = context_size
        self.context_frac = context_frac
        self.overlap_tokens = overlap_tokens

    def _split_text(self, text: str) -> Generator[str, None, None]:
        target_size = int(self.context_frac * self.context_size)
        start_idx = 0

        while start_idx < len(text):
            end_idx = min(start_idx + target_size, len(text))
            yield text[start_idx:end_idx]

            start_idx = end_idx - self.overlap_tokens if end_idx < len(text) else len(text)

    def build_prompt(
        self,
        instruction: str,
        input=None,  # pylint: disable=redefined-builtin, unused-argument
        output=None,  # pylint: disable=unused-argument
    ) -> Generator[str, None, None]:
        if len(instruction) > self.context_size:
            for chunk in self._split_text(instruction):
                yield chunk
        else:
            yield instruction

kmn1024 commented 1 year ago

mhenrichsen's code snippet looks perfect for the feature request. Is there something that's preventing this from being committed?

NanoCode012 commented 1 year ago

Uhm, I think this may already have been implemented. https://github.com/OpenAccess-AI-Collective/axolotl/blob/a21935f07af9d825d7730fe944d29cfdef3a5337/src/axolotl/prompt_strategies/completion.py#L52-L57