apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

module.Module and CSVIter #8669

Open GSanchis opened 6 years ago

GSanchis commented 6 years ago

Description

I'm trying to use module.Module to build a recommender system. Using NDArrayIter makes memory requirements skyrocket. But I have been unable to use CSVIter for the same purpose. I'm using MXNet under Python. I followed this tutorial to build the first recommender, but I don't seem to be able to move forward.

Minimum reproducible example

import mxnet as mx
import numpy
import time

file="f1.csv"
l,c,v = numpy.loadtxt(file, delimiter=',',dtype='int').T
# This should not be needed, but i need the l.max() operation, which I don't seem to be able to have in CSVIter

user = mx.symbol.Variable("user")
user = mx.symbol.Embedding(data=user, input_dim=l.max(), output_dim=10)
movie = mx.symbol.Variable("movie")
movie = mx.symbol.Embedding(data=movie, input_dim=c.max(), output_dim=10)
y_true = mx.symbol.Variable("softmax_label")
nn = mx.symbol.concat(user, movie)
nn = mx.symbol.flatten(nn)
nn = mx.symbol.FullyConnected(data=nn, num_hidden=64)
nn = mx.symbol.Activation(data=nn, act_type='relu')
nn = mx.symbol.FullyConnected(data=nn, num_hidden=1)
y_pred = mx.symbol.LinearRegressionOutput(data=nn, label=y_true)

tritems=int(len(l)*80/100)
X_train=mx.io.NDArrayIter({'user': l[:tritems], 'movie': c[:tritems]}, label=v[:tritems], batch_size=10000)
X_eval=mx.io.NDArrayIter({'user': l[tritems:], 'movie': c[tritems:]}, label=v[tritems:], batch_size=10000)
X_all=mx.io.NDArrayIter({'user': l, 'movie': c}, label=v, batch_size=10000)
model = mx.module.Module(context=mx.cpu(0), data_names=['user', 'movie'], symbol=y_pred)
model.fit(X_train, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='mse', eval_data=X_eval)

Works awesome!

Then I try to replace the mx.io.NDArrayIter by a CSVIter with

f1="tmp1"    # contains first two columns of f1.csv
f2="tmp2"    # contains the third column of f1.csv
CSVIter = mx.io.CSVIter(data_csv=f1, data_shape=(2,), label_csv=f2, label_shape=(1,), batch_size=1000)
model = mx.module.Module(context=mx.cpu(0), data_names=['user', 'movie'], symbol=y_pred)
model.fit(CSVIter, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),))

And I get the following error

  File "<stdin>", line 1, in <module>
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/base_module.py", line 460, in fit
    for_training=True, force_rebind=force_rebind)
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/module.py", line 400, in bind
    self.data_names, self.label_names, data_shapes, label_shapes)
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/base_module.py", line 71, in _parse_data_desc
    _check_names_match(data_names, data_shapes, 'data', True)
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/base_module.py", line 63, in _check_names_match
    raise ValueError(msg)
ValueError: Data provided by data_shapes don't match names specified by data_names ([DataDesc[data,(1000, 2),<class 'numpy.float32'>,NCHW]] vs. ['user', 'movie'])

What have you tried to solve it?

I have tried to reassign the CSVIter.provide_label and CSVIter.provide_data fields with no luck... I tried adding name='data' to all lines that have nn= with no luck... I've tried removing data_names from the mx.module.Module line with no luck... I might have been around 6 hours googling around and looking at the mx.io API, but I'm currently out of ideas.

Any help is welcome!!

eric-haibin-lin commented 6 years ago

Does specifying data_names in CSVIter help? CSVIter(...., data_name=['user','movie'])

GSanchis commented 6 years ago

Well, I should have said I [think I] also tried that :)

In fact, I just tried two options: data_names and data_name (since in your message you actually mention both), but no luck either.

First data_name:

model = mx.module.Module(context=mx.cpu(0), data_names=['user', 'movie'], symbol=y_pred)
CSVIter = mx.io.CSVIter(data_csv=f1, data_shape=(2,), label_csv=f2, label_shape=(1,), batch_size=1000, data_name=['user','movie'])
model.fit(CSVIter, num_epoch=5, optimizer='adam', optimizer_params=(('learning_rate', 0.001),))

returns

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/base_module.py", line 460, in fit
    for_training=True, force_rebind=force_rebind)
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/module.py", line 400, in bind
    self.data_names, self.label_names, data_shapes, label_shapes)
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/base_module.py", line 71, in _parse_data_desc
    _check_names_match(data_names, data_shapes, 'data', True)
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/base_module.py", line 63, in _check_names_match
    raise ValueError(msg)
ValueError: Data provided by data_shapes don't match names specified by data_names ([DataDesc[['user', 'movie'],(1000, 2),<class 'numpy.float32'>,NCHW]] vs. ['user', 'movie'])

Whereas data_names (as the error message seems to indicate, and which is also the valid parameter for mx.module.Module), returns:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/base_module.py", line 460, in fit
    for_training=True, force_rebind=force_rebind)
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/module.py", line 400, in bind
    self.data_names, self.label_names, data_shapes, label_shapes)
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/base_module.py", line 71, in _parse_data_desc
    _check_names_match(data_names, data_shapes, 'data', True)
  File "/home/german/anaconda3/lib/python3.6/site-packages/mxnet/module/base_module.py", line 63, in _check_names_match
    raise ValueError(msg)
ValueError: Data provided by data_shapes don't match names specified by data_names ([DataDesc[data,(1000, 2),<class 'numpy.float32'>,NCHW]] vs. ['user', 'movie'])

Which is the same error as originally... I.e., the correct parameter seems to be data_name, but there's still something missing.

Thanks for your help, Eric!!

eric-haibin-lin commented 6 years ago

Oh, did you check out this example? It uses sparse data format and saves memory even more: https://github.com/apache/incubator-mxnet/blob/master/example/sparse/matrix_factorization.py

GSanchis commented 6 years ago

Sorry, Eric. I had to move to another task, so I didn't have time to look into that example in as much detail as I wanted until now.

I just did.

The problem with this example is that the training data is loaded fully into memory, so if the amount of data is very large (in my case a couple of GBs) I run into high memory requirements just because of loading the data. That's why I was trying to use CSVIter, which I was expecting to iterate through the CSV file (and not load it into memory directly). But if there's any other solution that does not load all the data into memory, that would work too.

eric-haibin-lin commented 6 years ago

I see. MXNet should definitely update the api to allow custom data/label names. As a temporary walk-around, you can change the CSVIter.provide_label and CSVIter.provide_data attributes yourself to be the same as the ones produced by NDArrayIter. Would that work for you?

GSanchis commented 6 years ago

Ok, so with your message, and another test I did, I believe I have come to the reason of the problem... Now I just need to understand how to solve it :)

l,c,v = numpy.loadtxt(filename, delimiter='\t',dtype='int').T
iter_train = mx.io.NDArrayIter({'user': l, 'item': c}, label=v, batch_size=args.bsize)
print(iter_train.provide_data)

returns

[DataDesc[item,(100,),<class 'numpy.float32'>,NCHW], DataDesc[user,(100,),<class 'numpy.float32'>,NCHW]]

However,

iter_train = mx.io.CSVIter(data_csv=f1, data_shape=(2,), label_csv=f2, label_shape=(1,), batch_size=1000, data_name=['user','item'])
print(iter_train.provide_data)

outputs

[DataDesc[['user', 'item'],(1000, 2),<class 'numpy.float32'>,NCHW]]

In addition, I tried commenting out lines 62-65 on mxnet/module/base_module.py, the ones that led to the error, i.e.:

def _check_names_match(data_names, data_shapes, name, throw):
    """Check that input names matches input data descriptors."""
    actual = [x[0] for x in data_shapes]
    if sorted(data_names) != sorted(actual):
        msg = "Data provided by %s_shapes don't match names specified by %s_names (%s vs. %s)"%(
            name, name, str(data_shapes), str(data_names))
#        if throw:
#            raise ValueError(msg)
#        else:
#            warnings.warn(msg)

and then I got this error, which I believe means that it is not a simple naming problem:

Traceback (most recent call last):
  File "/home/german/trabajo/Sciling/git/recsys-sbcorporation/train2.py", line 172, in <module>
    model = train_model(ctx=mx.cpu(0), net=nn, data_names=('user','item'), train_iter=train_iter, val_iter=train_iter)
  File "/home/german/trabajo/Sciling/git/recsys-sbcorporation/train2.py", line 133, in train_model
    model.fit(train_iter, num_epoch=args.nepochs, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='mse', eval_data=val_iter)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/module/base_module.py", line 460, in fit
    for_training=True, force_rebind=force_rebind)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/module/module.py", line 428, in bind
    state_names=self._state_names)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/module/executor_group.py", line 237, in __init__
    self.bind_exec(data_shapes, label_shapes, shared_group)
  File "/usr/local/lib/python3.5/dist-packages/mxnet/module/executor_group.py", line 333, in bind_exec
    shared_group))
  File "/usr/local/lib/python3.5/dist-packages/mxnet/module/executor_group.py", line 598, in _bind_ith_exec
    input_shapes = dict(data_shapes)
TypeError: unhashable type: 'list'

adding both tests, I believe that the problem is that CSVIter is returning a (sort of) Nx2 matrix, whereas NDArrayIter is producing a 2xN matrix (i.e., CSVIter is producing the transposed version). However, I believe I need the 2xN version of NDArrayIter to produce the embeddings... but I might be wrong.

Again, and to keep the discussion contextualized, the core of my code currently looks like this:

f1 = "tmp1"; f2 = "tmp2"
iter_train = mx.io.CSVIter(data_csv=f1, data_shape=(2,), label_csv=f2, label_shape=(1,), batch_size=1000, data_name=['user','item'])
user = mx.symbol.Embedding(data=user, input_dim=max_user, output_dim=factor_size)
item = mx.symbol.Embedding(data=item, input_dim=max_item, output_dim=factor_size)
pred = mx.symbol.concat(user, item)
pred = mx.symbol.flatten(pred)
pred = mx.symbol.FullyConnected(data=pred, num_hidden=64)
pred = mx.symbol.Activation(data=pred, act_type='relu')
pred = mx.symbol.FullyConnected(data=pred, num_hidden=1)
pred = mx.symbol.LinearRegressionOutput(data=pred, label=score)
model = mx.module.Module(context=mx.cpu(0), data_names=('user','item'), symbol=pred)
model.fit(iter_train, num_epoch=10, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), eval_metric='mse', eval_data=val_iter)
GSanchis commented 6 years ago

So... diving deep into this, and into some of the python libraries of MXNet, I managed to craft a first version of an iterator which seems to do the job...

class myCSVIter(mx.io.DataIter):
    def __init__(self, data_names, data_shapes, label_names, label_shapes, 
                 csvfile, delimiter=',', batch_size=100):
        self.delimiter=delimiter
        self.batch_size = batch_size
        self._provide_data = [
            DataDesc('user', (batch_size,), int),
            DataDesc('item', (batch_size,), int)
        ]
        self._provide_label = [DataDesc('softmax_label',(batch_size,),numpy.float32)]
        self.file = open(csvfile,'r')
        self.csvreader = csv.reader(self.file, delimiter=delimiter)

    def __iter__(self):
        return self

    def reset(self):
        self.file.seek(0)
        self.csvreader = csv.reader(self.file, delimiter=self.delimiter)

    def __next__(self):
        return self.next()

    @property
    def provide_data(self):
        return self._provide_data

    @property
    def provide_label(self):
        return self._provide_label

    def next(self):
        l=[]
        c=[]
        v=[]
        try:
            for i in range(self.batch_size):
                row = [int(a) for a in next(self.csvreader)]
                l.append(row[0])
                c.append(row[1])
                v.append(row[2])
            data = [mx.nd.array(l), 
                    mx.nd.array(c)]
            label= list([mx.nd.array(v)])
            return mx.io.DataBatch(data=list(data), label=label)

Although I'm still getting this straight... especially because it seems to work properly on a CPU, but I got an error when trying the code on a GPU. I don't have access to the error now, but I'll continue tomorrow. Also, I still have to handle the case where the data is not a multiple of the batch_size.

GSanchis commented 6 years ago

So... progressing on this, I did get this to pad, and the GPU problems were solved by specifying the context in the mx.nd.array's for l,c,v. Code attached at the end.

However, I'm running into an issue with the SparseEmbeddings, and since I saw that you are the main contributor, I though I would post this here. Let me know if I should open another issue somewhere else.

The problem seems to be quite random: sometimes (but only sometimes), I run into the following error. The fact that it only appears sometimes does not help too much to debug it, and from what I recall from when I did some C programming it was due to wild pointers moving out of bounds and writing into unassigned memory:

[13:23:46] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [13:23:46] src/operator/tensor/indexing_op.cc:61: Check failed: is_valid SparseEmbedding input contains data out of bound

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x17c44c) [0x7f147723444c]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1daed41) [0x7f1478e66d41]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1ff9155) [0x7f14790b1155]
[bt] (3) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x23621e2) [0x7f147941a1e2]
[bt] (4) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x234490d) [0x7f14793fc90d]
[bt] (5) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2349241) [0x7f1479401241]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f14768bbc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f1489f686ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f1489c9e3dd]

[13:23:46] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [13:23:46] src/engine/./threaded_engine.h:359: [13:23:46] src/operator/tensor/indexing_op.cc:61: Check failed: is_valid SparseEmbedding input contains data out of bound

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x17c44c) [0x7f147723444c]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1daed41) [0x7f1478e66d41]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1ff9155) [0x7f14790b1155]
[bt] (3) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x23621e2) [0x7f147941a1e2]
[bt] (4) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x234490d) [0x7f14793fc90d]
[bt] (5) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2349241) [0x7f1479401241]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f14768bbc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f1489f686ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f1489c9e3dd]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 6 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x17c44c) [0x7f147723444c]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2344bb4) [0x7f14793fcbb4]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2349241) [0x7f1479401241]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f14768bbc80]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f1489f686ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f1489c9e3dd]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [13:23:46] src/engine/./threaded_engine.h:359: [13:23:46] src/operator/tensor/indexing_op.cc:61: Check failed: is_valid SparseEmbedding input contains data out of bound

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x17c44c) [0x7f147723444c]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1daed41) [0x7f1478e66d41]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x1ff9155) [0x7f14790b1155]
[bt] (3) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x23621e2) [0x7f147941a1e2]
[bt] (4) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x234490d) [0x7f14793fc90d]
[bt] (5) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2349241) [0x7f1479401241]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f14768bbc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f1489f686ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f1489c9e3dd]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 6 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x17c44c) [0x7f147723444c]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2344bb4) [0x7f14793fcbb4]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2349241) [0x7f1479401241]
[bt] (3) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f14768bbc80]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f1489f686ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f1489c9e3dd]

Let me know if there is something I can do to solve this problem.

I believe the core of the code that leads to this problem would be as follows:

iter_train = myCSVIter(data_names=('user','item'), data_shapes=[(2,)], csvfile='train',
                              label_names='softmax_label', delimiter='\t', batch_size=1000, context=mx.cpu(0)  )
iter_dev = myCSVIter(data_names=('user','item'), data_shapes=[(2,)], csvfile='dev',
                              label_names='softmax_label', delimiter='\t', batch_size=1000, context=mx.cpu(0)  )
user = mx.symbol.Variable("user")
item = mx.symbol.Variable("item")
score = mx.symbol.Variable("softmax_label")
user_weight = mx.symbol.Variable("user_weight", stype="row_sparse")
user = mx.symbol.contrib.SparseEmbedding(data=user, weight=user_weight, input_dim=2000, output_dim=20)
item_weight = mx.symbol.Variable("item_weight", stype="row_sparse")
item = mx.symbol.contrib.SparseEmbedding(data=item, weight=item_weight, input_dim=20000, output_dim=20)
pred = mx.symbol.concat(user, item)
pred = mx.symbol.flatten(pred)
pred = mx.symbol.FullyConnected(data=pred, num_hidden=64)
pred = mx.symbol.Activation(data=pred, act_type='relu')
pred = mx.symbol.FullyConnected(data=pred, num_hidden=1)
pred = mx.symbol.LinearRegressionOutput(data=pred, label=score)
model = mx.module.Module(context=mx.cpu(0), data_names=('user','item'), symbol=pred)
model.fit(iter_train, num_epoch=2, optimizer='adam', optimizer_params=(('learning_rate', 0.001),), 
                       eval_metric='mse', eval_data=iter_dev)
items = list(iter_train.indices[1])
users = [1] * len(items)
user_iter = mx.io.NDArrayIter({'user': users, 'item': items}, batch_size = 1000)
# The problem appears here, when model.predict
pred = model.predict(user_iter).asnumpy().flatten()

The code to the myCSVIter is now as follows:

class myCSVIter(mx.io.DataIter):
    """ This class builds a custom CSV iterator that returns the data in the same format
    a NDArrayIter would, when provided two column vectors.

    The code has been built by looking into the NDArrayIter implementation, and builds 
    data batches by iterating with csv.reader over the file. When the iterator is reset,
    the pointer of the file is set to 0 and a new csv.reader iterator is created.

    :param data_names: The data names to be returned.
    :param data_shapes: The data shapes.
    :param csvfile: The CSV File to read from. Must have three columns (user,item,score).
    :param label_shapes: The label shapes.
    :param delimiter: The delimiter in the CSV file. Defaults to ','.
    :param batch_size: The size of the batches to read. Defaults to 100.
    """
    def __init__(self, data_names, data_shapes, label_names, csvfile,
                 label_shapes=(1,), delimiter=',', batch_size=100, context=mx.cpu(0)):
        from mxnet.io import DataDesc
        from mxnet.ndarray import array
        from collections import OrderedDict
        import csv
        self.delimiter=delimiter
        self.batch_size = batch_size
        self.context = context
        self._provide_data = [
            DataDesc('user', (batch_size,), int),
            DataDesc('item', (batch_size,), int)
        ]
        self._provide_label = [DataDesc('softmax_label',(batch_size,),numpy.float32)]
        self.file = open(csvfile,'r')
        self.csvreader = csv.reader(self.file, delimiter=delimiter)
        # Initiate indices to as many empty sets as data_shapes says
        self.indices = [set() for i in range(0,data_shapes[0][0]+1)] # +1 for the label
        for l in self.csvreader:
            for i,c in enumerate([int(x) for x in l]):
                self.indices[i].add(c)
        self.maximum_values = [max(x) for x in self.indices]
        # Ensure that the CSV file is properly to the start for future iteration
        self.file.seek(0)

    def __iter__(self):
        return self

    def reset(self):
        import csv
        self.file.seek(0)
        # TODO: Does this leak the previous csvreader?
        self.csvreader = csv.reader(self.file, delimiter=self.delimiter)

    def __next__(self):
        return self.next()

    @property
    def provide_data(self):
        return self._provide_data

    @property
    def provide_label(self):
        return self._provide_label

    def next(self):
        l=[]; c=[]; v=[]
        try:
            # Build a batch by reading batch_size lines from the CSV file
            for i in range(self.batch_size):
                row = [int(a) for a in next(self.csvreader)]
                l.append(row[0])
                c.append(row[1])
                v.append(row[2])
            # Now build the three different NDArrays required
            data = [mx.nd.array(l, self.context),
                    mx.nd.array(c, self.context)]
            label= list([mx.nd.array(v,self.context)])
            # Return the batch itself
            return mx.io.DataBatch(data=list(data), label=label)
        except StopIteration:
            # If there is some data in the lcv arrays, it's the first time that
            # we reach the end of file, and the data in the lcv arrays needs to
            # be returned. StopIteration will be raised next time the iterator
            # is called, when the lcv arrays are empty.
            data_read = len(l)
            if data_read > 0:
                # There is some data. Pad and return.
                pad = self.batch_size - data_read
                for i in range(pad):
                  l.append(0)
                  c.append(0)
                  v.append(0)
                data = [mx.nd.array(l, self.context),
                        mx.nd.array(c, self.context)]
                label = list([mx.nd.array(v, self.context)])
                return mx.io.DataBatch(data=list(data), label=label, pad=pad)
            else:
                # No data. StopIteration can be raised now.
                raise StopIteration
eric-haibin-lin commented 6 years ago

This error indicates that the input to your sparse_embedding operator contains data greater or equal to input_dim. Any chance the data from your CSVIter feeds numbers not in the range [0, output_dim-1] ? @ZiyueHuang maybe we should improve the error message and show which input data is invalid?

GSanchis commented 6 years ago

I have been diving quite deeply into the code (I even took a look at the c++ code pointed to by the error), but I haven't been able to find anything that signals that there is any input that is really out of bounds. Also, the fact that the error only pops up say 50% also confuses me. I also tried to increase input_dim artificially (e.g., use input_dim+100 instead of the true input_dim, but that didn't work either. Also, the error appears when predicting, so the CSVIter is not used here:

user_iter = mx.io.NDArrayIter({'user': users, 'item': items}, batch_size = 1000)

Although the error could have happened before, and only triggered here. In fact, I found that the error seems to trigger in the _getdata method of NDArrayIter, where self.data seems to be broken (after some iterations, even a simple print self.data triggers the exception).

I'm going to do some more tests, though.

Btw, you mean range [0, input_dim-1], right?

eric-haibin-lin commented 6 years ago

Oh, right, input_dim - 1. So if you keep feeding the same batch of data, does the error occur?

GSanchis commented 6 years ago

After quite some time stumbling into this problem, apparently at random, I believe I have found out what the problem was, after reading this example, which imports this module:

def max_id(fname):
    mu = 0
    mi = 0
    for line in open(fname):
        tks = line.strip().split('\t')
        if len(tks) != 4:
            continue
        mu = max(mu, int(tks[0]))
        mi = max(mi, int(tks[1]))
    return mu + 1, mi + 1

then:

max_user, max_item = max_id('./ml-100k/u.data')

Finally:

    user = mx.symbol.Embedding(data = user, input_dim = max_user, output_dim = k) 
    item = mx.symbol.Embedding(data = item, input_dim = max_item, output_dim = k)

So if the maximum user id in the data is e.g. 10, then the input_dim for the Embedding should be 11. Is that correct? I would say this is not very intuitive, especially given that the documentation for the Embedding operator is focused on the word embedding case. In my case, a "vocabulary" of {1..10} seems to require input_dim=11. Am I correct? Is it also the case for the sparse embeddings? If so, perhaps the documentation could be updated so that future users don't stumble into this. I am willing to help with some text, to the extent I am able to.