in training sm_cnn, ValueError: could not convert string to float: '<pad>'

liudonglei commented 6 years ago

$ python train.py --mode static --gpu 1
Note: You are using GPU for training Dataset TREC Mode static VOCAB num 13 LABEL.target_class: 13 LABELS: ['', '2', '0', '7', '3', '1', '8', '4', '5', '9', '6', '\t', '.'] Train instance 53417 Dev instance 1148 Test instance 1517 Shift model to GPU Time Epoch Iteration Progress (%Epoch) Loss Dev/Loss Accuracy Dev/Accuracy Traceback (most recent call last): File "train.py", line 147, in for batch_idx, batch in enumerate(train_iter): File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/iterator.py", line 151, in iter self.train) File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/batch.py", line 27, in init setattr(self, name, field.process(batch, device=device, train=train)) File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 188, in process tensor = self.numericalize(padded, device=device, train=train) File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 308, in numericalize arr = self.postprocessing(arr, None, train) File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 37, in call x = pipe.call(x, args) File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in call return [self.convert_token(tok, args) for tok in x] File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in return [self.converttoken(tok, *args) for tok in x] File "train.py", line 62, in postprocessing=data.Pipeline(lambda arr, , train: [float(y) for y in arr])) File "train.py", line 62, in postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr])) ValueError: could not convert string to float: ''

liudonglei commented 5 years ago

(castor) [ldl@402 sm_cnn 15:15:35] $ python train.py --mode static --no_cuda Dataset TREC Mode static VOCAB num 13 LABEL.target_class: 13 LABELS: ['', '2', '0', '7', '3', '1', '8', '4', '5', '9', '6', '\t', '.'] Train instance 53417 Dev instance 1148 Test instance 1517 Time Epoch Iteration Progress (%Epoch) Loss Dev/Loss Accuracy Dev/Accuracy Traceback (most recent call last): File "train.py", line 147, in for batch_idx, batch in enumerate(train_iter): File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/iterator.py", line 151, in iter self.train) File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/batch.py", line 27, in init setattr(self, name, field.process(batch, device=device, train=train)) File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/field.py", line 188, in process tensor = self.numericalize(padded, device=device, train=train) File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/field.py", line 308, in numericalize arr = self.postprocessing(arr, None, train) File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 37, in call x = pipe.call(x, args) File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in call return [self.convert_token(tok, args) for tok in x] File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in return [self.converttoken(tok, *args) for tok in x] File "train.py", line 62, in postprocessing=data.Pipeline(lambda arr, , train: [float(y) for y in arr])) File "train.py", line 62, in postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr])) ValueError: could not convert string to float: ''

Impavidity commented 5 years ago

Hey @liudonglei To my understanding, you are using your own dataset, right ? Can you post your dataset format in this thread? It will be more easier for me to understand this issue.

liudonglei commented 5 years ago

@Impavidity Not my own dataset, I just try the sm_cnn model on TrecQA dataset in your Castor-data repo, My all steps follow the steps in Castor/README.md and Castor/sm_cnn/README.md

SawanKumar28 commented 5 years ago

Hi @liudonglei, were you able to resolve this issue? I am facing the same issue.

liudonglei commented 5 years ago

Hi @liudonglei, were you able to resolve this issue? I am facing the same issue.

Sorry, I can't, I am unfamiliar with the torchtext package this repo used.

liudonglei commented 5 years ago

@rosequ @SawanKumar28 Hi, today i try this repo again and fix this problem, this problem come from the file trec_dataset.py to use the torchtext.data.TabularDataset. I don't know why, That maybe some bug of Python's class inheritance. after debugging half day, I locate the file trec_dataset.py and borrow the similar code from BLOG http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext to make the repo works.

you can just replace the trec_dataset.py with the bellow code:

----the right trec_dataset.py file ---- from torchtext import data

class TrecDataset: dirname = 'data' @classmethod def splits(self, question_id, question_field, answer_field, external_field, label_field):

    tv_datafields = [('qid', question_id), ('label', label_field), ('question', question_field),
        ('answer', answer_field), ('ext_feat', external_field)]

    train, dev, test  = data.TabularDataset.splits(
        path="data", # the root directory where the data lies
        #train='train.csv', validation="valid.csv",
        train='trecqa.train.tsv', validation='trecqa.dev.tsv', test='trecqa.test.tsv',
        #train='ttt.csv', validation='ttt.csv', test='ttt.csv',
        format='tsv',
        #skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tv_datafields)
    return train, dev, test

castorini / castor

in training sm_cnn, ValueError: could not convert string to float: '<pad>' #142