bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.62k stars 487 forks source link

Sparse Model Support #168

Open xiongji opened 4 years ago

xiongji commented 4 years ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] I trained deepFM model using BytePS and horovod. BytePS is 78seconds/100steps. and Horovod is 18 seconds/100steps. The key performace is in embeding which size is over 30000000.

Describe the solution you'd like A clear and concise description of what you want to happen.

When does BytePS will support Sparse Model ? If i want to implement it, can you give me some advice. Or i have to abort the research of BytePS and to use Horovod instead.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

bobzhuyb commented 4 years ago

Properly supporting sparse model requires some work and it won't happen in 2019. Right now our priority is still to push the limit of training dense models.

If possible, would you provide some reference training scripts (like your DFM model)? We'll see what we can do.

We will merge our new server implementation soon. https://github.com/bytedance/byteps/pull/151 I believe it will be much easier to implement sparse model support then.

xiongji commented 4 years ago

Properly supporting sparse model requires some work and it won't happen in 2019. Right now our priority is still to push the limit of training dense models.

If possible, would you provide some reference training scripts (like your DFM model)? We'll see what we can do.

I'm afraid i cannot put code here... It's simply like this:

self.weights["id_dense_emb"] = tf.Variable(
tf.random_uniform([30000000, 8], 0.0, 1.0),
name="id_dense_emb") 
self.user_embed = tf.nn.embedding_lookup(self.weights[id_dense_emb"], self.id_denses)
and then some concat and math calculate transforms ...

We will merge our new server implementation soon. #151 I believe it will be much easier to implement sparse model support then.

Await your server...

bobzhuyb commented 4 years ago

Is self.id_denses static during the whole training?

xiongji commented 4 years ago

Is self.id_denses static during the whole training?

no. It's read from the input samples

self.batch_data = self.iterator.get_next()
self.id_denses = self.batch_data['id_dense']

Can I compile your server branch now, If I attempt to implement sparse model support? Meanwhile, should I use your bytescheduler branch for scheduler if use your own server.

ymjiang commented 4 years ago

@xiongji Sure, you can use the server branch right now, the core code is ready and you don't need to change any other python script in order to run it. I think https://github.com/bytedance/byteps/pull/151 will be merged in one week or two.

You won't need the bytescheduler branch for the scheduler. I know it might be a little bit confusing though..

xiongji commented 4 years ago

@xiongji Sure, you can use the server branch right now, the core code is ready and you don't need to change any other python script in order to run it.

You won't need the bytescheduler branch for the scheduler. I know it might be a little bit confusing though..

OK, I read through BytePS server source code again.

bobzhuyb commented 4 years ago

Okay, so there are basically two ways:

  1. enable BytePS to do allgather (or something equivalent) on the gradients, this is the same as Horovod.
  2. push and pull the parameters of the huge embedding layer based on the self.id_denses. This fits better for the "PS" communication pattern.

Either way, I believe it can only be done in the C++ core logic for synchronous training. However, if you are doing asynchronous training, the second approach may be done in the Python layer.

xiongji commented 4 years ago

Okay, so there are basically two ways:

  1. enable BytePS to do allgather (or something equivalent) on the gradients, this is the same as Horovod.
  2. push and pull the parameters of the huge embedding layer based on the self.id_denses. This fits better for the "PS" communication pattern.

Either way, I believe it can only be done in the C++ core logic for synchronous training. However, if you are doing asynchronous training, the second approach may be done in the Python layer.

Thanks, i want to do synchronous training. Do i need push/pull the indices and values of grads separately in the Python layer?

bobzhuyb commented 4 years ago

Okay, so there are basically two ways:

  1. enable BytePS to do allgather (or something equivalent) on the gradients, this is the same as Horovod.
  2. push and pull the parameters of the huge embedding layer based on the self.id_denses. This fits better for the "PS" communication pattern.

Either way, I believe it can only be done in the C++ core logic for synchronous training. However, if you are doing asynchronous training, the second approach may be done in the Python layer.

Thanks, i want to do synchronous training. Do i need push/pull the indices and values of grads separately in the Python layer?

If you go with the second approach, yes, you need to to that in the Python layer, and ultimately it requires C++ layer implementation.

There is one more ugly way to do this... but may be the most backward compatible: In addition to BytePS, we also init Horovod. Whenever we see a sparse tensor that needs synchronization, we fallback to Horovod's allgather.

This would require that BytePS can be launched by MPI (because Horovod and allgather depends on it). We need a new MPI-based launcher. The good thing is that all these can be done in pure Python...

xiongji commented 4 years ago

Okay, so there are basically two ways:

  1. enable BytePS to do allgather (or something equivalent) on the gradients, this is the same as Horovod.
  2. push and pull the parameters of the huge embedding layer based on the self.id_denses. This fits better for the "PS" communication pattern.

Either way, I believe it can only be done in the C++ core logic for synchronous training. However, if you are doing asynchronous training, the second approach may be done in the Python layer.

Thanks, i want to do synchronous training. Do i need push/pull the indices and values of grads separately in the Python layer?

If you go with the second approach, yes, you need to to that in the Python layer, and ultimately it requires C++ layer implementation.

There is one more ugly way to do this... but may be the most backward compatible: In addition to BytePS, we also init Horovod. Whenever we see a sparse tensor that needs synchronization, we fallback to Horovod's allgather.

This would require that BytePS can be launched by MPI (because Horovod and allgather depends on it). We need a new MPI-based launcher. The good thing is that all these can be done in pure Python...

Maybe i need C++ layer implementation. Because horovod is used already.

xiongji commented 4 years ago

Okay, so there are basically two ways:

  1. enable BytePS to do allgather (or something equivalent) on the gradients, this is the same as Horovod.
  2. push and pull the parameters of the huge embedding layer based on the self.id_denses. This fits better for the "PS" communication pattern.

Either way, I believe it can only be done in the C++ core logic for synchronous training. However, if you are doing asynchronous training, the second approach may be done in the Python layer.

Thanks, i want to do synchronous training. Do i need push/pull the indices and values of grads separately in the Python layer?

If you go with the second approach, yes, you need to to that in the Python layer, and ultimately it requires C++ layer implementation.

There is one more ugly way to do this... but may be the most backward compatible: In addition to BytePS, we also init Horovod. Whenever we see a sparse tensor that needs synchronization, we fallback to Horovod's allgather.

This would require that BytePS can be launched by MPI (because Horovod and allgather depends on it). We need a new MPI-based launcher. The good thing is that all these can be done in pure Python...

As we just discussed, it's a pity that I will abort the research of BytePS... We will use horovod instead.

bobzhuyb commented 4 years ago

@xiongji It's totally okay. You should just choose what works the best for you. We are very clear that byteps does not work well for sparse model for now.

liiitleboy commented 2 years ago

@xiongji It's totally okay. You should just choose what works the best for you. We are very clear that byteps does not work well for sparse model for now.

两年过去了,请问现在支持了吗

ymjiang commented 2 years ago

We have implemented P2P communication using BytePS and it supports our internal sparse models well. Will take some time to release for open-source as it is not a high priority.

liiitleboy commented 2 years ago

We have implemented P2P communication using BytePS and it supports our internal sparse models well. Will take some time to release for open-source as it is not a high priority.

OK,good,thank you,你们加油哇。

QwertyJack commented 3 weeks ago

插眼