Error with larger batch sizes

trinayan commented 4 years ago

Hi,

I am receiving some weird errors when using larger batch sizes that seems to be tied to the way the code is using the multiprocessing library. I am using all the packages mentioned in requirements. And it seems to work fine until 64 batch size but starts running into issues above that. For example I run this command "python3 train_paper_field.py --data_dir . --model_dir . --domain _NN --conv_name hgt --n_epoch 1 --n_batch 256". The error when running after some time is "Exception in thread Thread-3: Traceback (most recent call last): File "/usr/local/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/local/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/multiprocessing/pool.py", line 463, in _handle_results task = get() File "/usr/local/lib/python3.6/multiprocessing/connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) File "/home/trinayan/gnn_bench/pyHGT/env/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd fd = df.detach() File "/usr/local/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/usr/local/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle return recvfds(s, 1)[0] File "/usr/local/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds len(ancdata)) RuntimeError: received 0 items of ancdata"

Not sure why this is happening though. Would be glad if you can take a look. Thank you

acbull commented 4 years ago

Hi:

When you use a multiprocess package, whenever you receive an error from the executed function (in our case, it's node_classification_sample), it may raise such error.

What I can guess is: in line https://github.com/acbull/pyHGT/blob/b9c555a3b1f8afdd77139b779d2a0aaf14d62ade/OAG/train_paper_field.py#L95, we sample batch_size pairs from all training data. Is it possible for your processed graph, the pair number (e.g., validation set) is lower than your batch_size?

acbull commented 4 years ago

In that case, you can simply delete "replace = False"

trinayan commented 4 years ago

Hi. Thanks for your reply. I tried your suggestion and it runs into the same issue. I even tried using n_pool to 1 which also does not help. I am using the NN graph by the way. not sure if you are able to replicate this problem on your end.

acbull commented 4 years ago

I'll take a look at that.

trinayan commented 4 years ago

just wondering if you were able to replicate the issue. should i try other graphs?

acbull commented 4 years ago

Hi:

I've tried the NN graph with the command as you sent, and it turns out okay to run it.

snip

I'm not sure whether you're using the latest code. Can you re-pull the repository and try it again?

trinayan commented 4 years ago

thank you

acbull / pyHGT

Error with larger batch sizes #13