Closed maximegmd closed 1 year ago
Yess
+1
I remember seeing how llama (v1) did pretraining and they included metadata I think on each set of 2048 tokens maybe? If we can track down the appropriate algorithm for this and how metadata for each row should be formatted, I think we can make some progress on this soon.
The reasoning behind doing this within Axolotl and not doing it ahead of time is that doing it ahead of time locks the dataset to a specific context size and would require a new generation pass to target another context size.
Regarding this, there is a pretraining_dataset
config which allows streaming for completion
type pre-training. Another thing is that thanks to a PR, the tokenization should be much faster than before.
I don't understand how that's relevant? Is there a sliding window with regards to tokenization when streaming completion data?
How about something like this?
class CompletionPrompter:
"""
Prompter for completion
"""
def __init__(self, context_size: int, context_frac: float = 0.8, overlap_tokens: int = 0):
self.context_size = context_size
self.context_frac = context_frac
self.overlap_tokens = overlap_tokens
def _split_text(self, text: str) -> Generator[str, None, None]:
target_size = int(self.context_frac * self.context_size)
start_idx = 0
while start_idx < len(text):
end_idx = min(start_idx + target_size, len(text))
yield text[start_idx:end_idx]
start_idx = end_idx - self.overlap_tokens if end_idx < len(text) else len(text)
def build_prompt(
self,
instruction: str,
input=None, # pylint: disable=redefined-builtin, unused-argument
output=None, # pylint: disable=unused-argument
) -> Generator[str, None, None]:
if len(instruction) > self.context_size:
for chunk in self._split_text(instruction):
yield chunk
else:
yield instruction
mhenrichsen's code snippet looks perfect for the feature request. Is there something that's preventing this from being committed?
Uhm, I think this may already have been implemented. https://github.com/OpenAccess-AI-Collective/axolotl/blob/a21935f07af9d825d7730fe944d29cfdef3a5337/src/axolotl/prompt_strategies/completion.py#L52-L57
Hello,
As we discussed on discord, pretraining is usually done on large chunks of text (books, entire web pages, articles) that are larger than the context size. With this in mind I propose that we introduce an automatic split mechanism for such datasets.
The reasoning behind doing this within Axolotl and not doing it ahead of time is that doing it ahead of time locks the dataset to a specific context size and would require a new generation pass to target another context size.
I also propose the following settings:
context_frac: 0.8
: Splits the text up to context_frac * context sizesplit_regexes
: A list of regex used to split the text, this can help improve the split points.overlap_tokens
: How many tokens overlap between two split chunks.