epfLLM / Megatron-LLM

distributed trainer for LLMs
Other
504 stars 73 forks source link

Feature Request: Can we directly use the huggingface dataset for training #65

Closed dumpmemory closed 9 months ago

dumpmemory commented 10 months ago

Can we use huggingface dataset instead of megatron style for training.

dumpmemory commented 10 months ago

might be useful https://github.com/huggingface/Megatron-LM, https://github.com/huggingface/accelerate/blob/69e4c3c54da3201eda288b500d138761e7a5221c/src/accelerate/utils/megatron_lm.py

kylematoba commented 10 months ago

hi, did you see this https://epfllm.github.io/Megatron-LLM/guide/weights_conversion.html?

dumpmemory commented 10 months ago

hi, did you see this https://epfllm.github.io/Megatron-LLM/guide/weights_conversion.html?

Yes i did. you post is related to weights.

martinjaggi commented 9 months ago

for the weights, conversion to and from HF is already well supported

for dataset loaders, we currently stick to the megatron-LM one and the pipeline also used by open assistant.

if people agree the other data loader would be beneficial and is proven to work well in the distributed setting, feel free to re-open and/or file a PR