Closed sunilitggu closed 1 year ago
@sunilitggu it seems you do not have enough disk space? could you try to increase the disk space allocated or this docker container?
@sunilitggu if you are not able to increase the disk size, you can consider:
@sunilitggu is your problem solved?
@sunilitggu is your problem solved?
Not yet.
@sunilitggu Have you communicated with @aurickq about the suggested solution?
The root cause is that you're loading the entire training dataset into the ray object store; when the object store has 60% of the space used ray will start spilling some contents to the disk. Meanwhile, you do not allocate enough disk space for your docker container.
Could you try:
Looking at the errors below, the amount of data spilled it’s around 131GBs in total, roughly 500MB for each object. Could you figure out what these objects are? That might give some hints
�[2m�[36m(raylet)�[0m Spilled 6144 MiB, 12 objects, write throughput 1257 MiB/s.
�[2m�[36m(raylet)�[0m Spilled 11265 MiB, 21 objects, write throughput 1304 MiB/s.
�[2m�[36m(raylet)�[0m Spilled 17410 MiB, 35 objects, write throughput 1431 MiB/s.
�[2m�[36m(raylet)�[0m Spilled 35844 MiB, 69 objects, write throughput 1606 MiB/s.
�[2m�[36m(raylet)�[0m Spilled 65800 MiB, 146 objects, write throughput 1562 MiB/s.
�[2m�[36m(raylet)�[0m Spilled 131207 MiB, 636 objects, write throughput 1740 MiB/s.
I am not familiar with SIF so cannot help further in this direction.
@sunilitggu is your problem solved?
@zhisbug Not yet. I haven't got time to explore it further.
closed due to inactivity
Please describe the bug Trying to train a GPT3 6.7B parameters model using the code https://github.com/alpa-projects/alpa/blob/main/examples/gpt2/run_clm_flax.py on 2 nodes, each with 8 V100 GPUS clustered using Ray.
Please describe the expected behavior
System information and environment
To Reproduce Steps to reproduce the behavior:
Command used to run sbatch call_alpa.sh
call_alpa.sh
module load singularity export ALPA_IMG="/home/pub/singularity/general/alpa-v0.1.6.sif"
export SINGULARITYENV_PREPEND_PATH="/opt/conda/envs/alpa/bin"
export HEAD="$(scontrol show hostname ${SLURM_NODELIST} | head -n1)" export RAY_PORT="6379"
srun singularity exec --nv --bind /home/ss1/project/language_model/model_dumps/alpa_models:/data ${ALPA_IMG} bash alpa_cmd.sh
config.json
Screenshots
Docker file used to create an image
Additional information The Docker image is converted to a singularity file and running using SIF file