NVIDIA / sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification
Other
1.06k stars 202 forks source link

Input file format #40

Closed dwinkler1 closed 5 years ago

dwinkler1 commented 6 years ago

Thank your for the awesome model. I would like to train it on a number of longer text documents and was wondering in what format I should pass the texts to the script. Can I just put them all in a single text file and pass that to main.py? Or would it be better to put them in a Json or csv file with one entry per file even though I do not have labels? Sorry, I am kind of confused since the model is unsupervised but the datasets still have labels.

raulpuric commented 5 years ago

"Or would it be better to put them in a Json or csv file with one entry per file even though I do not have labels?" Yeah we currently require a line-by-line data format. You don't need to have a labels column (if you don't have one it will be automatically interpretted that the label is -1). You will however need a header for the text column (the default text column name is sentence).

You should just be able to plug in that csv and watch it train.

You may have to edit the --num_shards and --split arguments to reflect your dataset size and if you want to have the training set split into train/val/test.

dwinkler1 commented 5 years ago

Thank you very much for your quick response! The model is working now with a test dataset. However, if I want to run it with lazy loading I get the following error:

python main.py --data all.csv --text_key obs --lazy
configuring data
Traceback (most recent call last):
  File "main.py", line 145, in <module>
    train_data, val_data, test_data = data_config.apply(args)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/configure_data.py", line 16, in apply
    return make_loaders(opt)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/configure_data.py", line 61, in make_loaders
    train = data_utils.make_dataset(**data_set_args)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/data_utils/__init__.py", line 82, in make_dataset
    num_shards=num_shards)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/data_utils/__init__.py", line 62, in post_process_ds
    ds = unsupervised_dataset(ds, seq_length, persist_state=persist_state, num_shards=shards)
  File "/home/imsm/Documents/daniel_tmp/sentimentNvidia/sentiment-discovery-master/data_utils/datasets.py", line 343, in __init__
    self.str_ends = list(accumulate(ds.lens))
TypeError: 'int' object is not iterable

Any ideas of what I am doing wrong?

raulpuric commented 5 years ago

Few questions to help me better understand the situation: 1) How long is your dataset? 2) Can you try with --num_shards 12 --split 10,1,1 3) can you print out ds.lens right before that line gets called and share the printout with me?

dwinkler1 commented 5 years ago
  1. I am still trying this with a small set of 53 texts
  2. Yes this works perfectly! This just splits the data into 12 parts and assigns 10 to training 1 to validation and 1 to testing, correct?
  3. This is the output now which seems fine: (195712, 173717, 184668, 178144, 204400, 168369, 178426, 194301, 197089, 203105, 188377, 181090, 175146, 170020, 207781, 184943, 172659, 184966, 182678, 159603, 175119, 159343, 183476, 197196, 195795, 220643, 201818, 172217, 171050, 147559, 172960, 222118, 215824, 164828, 164853, 222051, 193248, 196606, 182188, 217619, 199771, 192167, 183582, 161843) (179303, 188947, 167750, 169158) (214121, 187194, 169937, 172823, 215781)

Thank you for your help!!

raulpuric commented 5 years ago

yeah so the problem was that the dataset was being split into 102 parts by default so some of them had no data