gluon.utils.split_and_load(even_split=False) fails if num of contexts > num of data

Description

Sometimes, it is hard to predict how much data left in the dataloader by the time last batch comes. When doing multigpu training with last_batch=keep, it could happen that the number of items left in the last batch is smaller than numbers of gpus. In that case gluon.utils.split_and_load throws an exception ValueError: Too many slices for data with shape ....

It would be great if it could work transparently. I would expect that if the parameter even_split on DataLoader is set to False, then exception shouldn't happen: the data should be distributed in a way that some arrays are empty, and later on the calculation of forward and backward passes with empty arrays are just silently ignored.

Environment info (Required)

----------Python Info----------
Version      : 3.6.4
Compiler     : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
Build        : ('default', 'Jan 16 2018 12:04:33')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.0
Directory    : /Users/sssokolo/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/Users/sssokolo/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Version      : 1.5.0
Directory    : /Users/sssokolo/anaconda3/lib/python3.6/site-packages/mxnet
Commit Hash   : fd34dc5f847192dfd522555afdf13be1eb67b72b
----------System Info----------
Platform     : Darwin-16.7.0-x86_64-i386-64bit
system       : Darwin
node         : 8c859074eea0
release      : 16.7.0
version      : Darwin Kernel Version 16.7.0: Sun Oct 28 22:30:19 PDT 2018; root:xnu-3789.73.27~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0299 sec, LOAD: 0.6207 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0008 sec, LOAD: 0.1785 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0008 sec, LOAD: 0.1612 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0007 sec, LOAD: 0.1032 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0007 sec, LOAD: 0.4562 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0006 sec, LOAD: 0.0634 sec.

Package used (Python/R/Scala/Julia): Python

Error Message:

Traceback (most recent call last):
  File "/Volumes/Unix/workspace/exception_small_batch_to_split/main.py", line 25, in <module>
    data = utils.split_and_load(data, context, even_split=False)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mxnet/gluon/utils.py", line 116, in split_and_load
    slices = split_data(data, len(ctx_list), batch_axis, even_split)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mxnet/gluon/utils.py", line 69, in split_data
    "num_slice=%d and batch_axis=%d."%(str(data.shape), num_slice, batch_axis))
ValueError: Too many slices for data with shape (1, 5). Arguments are num_slice=2 and batch_axis=0.

Minimum reproducible example

Just regular minimal multicontext training loop is enough:

import mxnet as mx
from mxnet import nd, gluon, autograd
from mxnet.gluon import utils, Trainer
from mxnet.gluon.data import ArrayDataset, DataLoader
from mxnet.gluon.loss import SoftmaxCrossEntropyLoss

context = [mx.cpu(0), mx.cpu(1)]
datasize = 3
batch_size_per_context = 1

data = nd.random.uniform(-1, 1, shape=(datasize, 5))
label = nd.random.uniform(-1, 1, shape=(datasize, 1))

dataset = ArrayDataset(data, label)
dataloader = DataLoader(dataset,
                        batch_size=len(context) * batch_size_per_context,
                        last_batch='keep')

net = gluon.nn.Dense(units=2)
net.initialize(ctx=context)

loss_fn = SoftmaxCrossEntropyLoss()
trainer = Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})

for (data, label) in dataloader:
    data = utils.split_and_load(data, context, even_split=False)
    label = utils.split_and_load(label, context, even_split=False)

    losses = []

    for d, l in zip(data, label):
        with autograd.record():
            out = net(d)
            losses.append(loss_fn(out, l))

    for loss in losses:
        loss.backward()

    trainer.step(1)

apache / mxnet