Closed chhzh123 closed 2 years ago
@aksnzhy
@chhzh123 Thanks for reporting this issue. You are right, we assume that on each machine we have a local server process for now. We should support the case that they are on different machines.
@chhzh123 Thanks for reporting this issue. You are right, we assume that on each machine we have a local server process for now. We should support the case that they are on different machines.
So are there any workarounds that I could try to avoid this issue?
I think we have standalone mode for KVServer and KVClient, but haven't tested yet. Why would you prefer deploy seperately?
I think we have standalone mode for KVServer and KVClient, but haven't tested yet. Why would you prefer deploy seperately?
We have large graph embeddings which need to be stored on several machines. The client needs to fetch embeddings from these machines to do computation.
I think we have standalone mode for KVServer and KVClient, but haven't tested yet. Why would you prefer deploy seperately?
We have large graph embeddings which need to be stored on several machines. The client needs to fetch embeddings from these machines to do computation.
Can we use these machines for both storing embedding and training? You can tune the number of server count and trainer count for the best performance.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
This issue is closed due to lack of activity. Feel free to reopen it if you still have questions.
Seems like the distributed KVStore module has not been thoroughly tested. I am able to start KVServer and KVClient on the same machine to do data communication. However, if I start the KVServer and the KVClient on different machines, the data cannot be correctly initialized.
To Reproduce
Steps to reproduce the behavior:
Use the example function (
test_kv_store
) in test_new_kvstore.py, and configure the server and client on different machines.The traceback is shown below. As you can see, the client is successfully connected to the server. However, the client cannot create the
data_0
tensor since it looks up the shared memory on the current machine while this tensor should have been initialized on the other machine.Expected behavior
If the server and the client are on the same machine, the shared memory can be used for communication. Otherwise, only a placeholder for the corresponding tensor is needed to be created on the client side, and it is no need to check the shared memory.
Environment
conda
,pip
, source): pip