streaming_language_modeling (which we use mainly for pre-training - requires data to be streamed in as text / tokenized-on-the-fly as opposed to being tokenized ahead of time)
language_modeling (which was previously used for pre-training but has since been replaced by streaming_language_modeling)
streaming_finetune_language_modeling (which another team has used for fine-tuning)
language_modeling_inference_for_models_trained_with_streaming (originally added for evals, unclear if still needed)
We should remove all but one of these to simplify the codebase.
Right now, we have:
streaming_language_modeling
(which we use mainly for pre-training - requires data to be streamed in as text / tokenized-on-the-fly as opposed to being tokenized ahead of time)language_modeling
(which was previously used for pre-training but has since been replaced by streaming_language_modeling)streaming_finetune_language_modeling
(which another team has used for fine-tuning)language_modeling_inference_for_models_trained_with_streaming
(originally added for evals, unclear if still needed)We should remove all but one of these to simplify the codebase.