Empty pickle data after preparing train datasets

landkwon94 commented 1 year ago

Hello Sir, first of all, many thanks for your contributions for TFT github codes!

I followed your tutorials in this link : https://playtikaoss.github.io/tft-torch/build/html/index.html

And I found it took so much time for preparing train datasets (https://playtikaoss.github.io/tft-torch/build/html/tutorials/DataGenerationExample.html)

So, I reduced the train.csv (randomly picking rows) to quickly see the train/test results.

But after I train TFT again, I got empty pickle datasets:

terminal is as follow --->


(venv_tft) [ryoungseob@bess23 00_pilot_study]# python train_tft.py
['data_sets', 'feature_map', 'scalers', 'categorical_cardinalities']
{'train': {}, 'validation': {}, 'test': {}}
=======
train
=======
=======
validation
=======
=======
test
=======
/usr/local/anaconda3/2021.05/envs/venv_tft/lib/python3.7/site-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Traceback (most recent call last):
  File "train_tft.py", line 136, in <module>
    train_set,train_loader,train_serial_loader                = get_set_and_loaders(data['data_sets']['train'], shuffled_loader_config, serial_loader_config, ignore_keys=meta_keys)
  File "train_tft.py", line 126, in get_set_and_loaders
    loader = torch.utils.data.DataLoader(dataset,**shuffled_loader_config)
  File "/usr/local/anaconda3/2021.05/envs/venv_tft/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 344, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
  File "/usr/local/anaconda3/2021.05/envs/venv_tft/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 106, in __init__
    if not isinstance(self.num_samples, int) or self.num_samples <= 0:
  File "/usr/local/anaconda3/2021.05/envs/venv_tft/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 114, in num_samples
    return len(self.data_source)
  File "train_tft.py", line 112, in __len__
    return getattr(self, self.keys_list[0]).shape[0]
IndexError: list index out of range

May I get some advice about this errors?

Thank you so much Sir!

Dvirbeno commented 1 year ago

Hi, and thanks for trying this out!

First, note that this is just an example of how data can be generated, and how the input data (for the next stage) should be structured. Here, we specifically chose to work on one of the datasets mentioned in the paper.

Running this specific example shouldn't take too long and should only be done once.

Up until now, I didn't bump into the problem you're raising (probably because I didn't try to modify the input data file :)).

You can walk through the example again, and check what you're left with when you get to the Splitting Data stage (https://playtikaoss.github.io/tft-torch/build/html/tutorials/DataGenerationExample.html#Splitting-Data) - what does data_df contain?

Update if there are any findings.

landkwon94 commented 1 year ago

Hello Sir,

Thank you so much for fast reply!!

First, the code below takes so much time in my PC (around a week, my memory and CPU is not small though),, Is it similar in your PC?

for col in tqdm(feature_cols): if col in categorical_attrs: le = scalers['categorical'][col] le_dict = dict(zip(le.classes_, le.transform(le.classes_))) data_df[col] = data_df[col].apply(lambda x: le_dict.get(x, max(le.transform(le.classes_)) + 1)) data_df[col] = data_df[col].astype(np.int32) else: data_df[col] = scalers['numeric'][col].transform(data_df[col].values.reshape(-1, 1)).squeeze() data_df[col] = data_df[col].astype(np.float32)

Second, I am trying to follow your comments! I ran again the code with the original data (not modified one) :)

Third, when I reduced train.csv and ran the code, I saved the data_df into csv file. The below contents are the sample rows of data_df (modified version).

I will wait for your reply, while running the codes with original datasets (not modified version).

Super-thanks for your time and contributions!! 👍

Dvirbeno commented 1 year ago

Hi again,

First, the code below takes so much time in my PC (around a week, my memory and CPU is not small though),, Is it similar in your PC?

No, it takes no more than a few minutes. Definitely not a week.

Third, when I reduced train.csv and ran the code, I saved the data_df into csv file. The below contents are the sample rows of data_df (modified version).

It seems like something got messed up in the resulted dataframe (which explains why you end up with empty dictionaries eventually). From the csv sheet, it's hard to understand what does the Unnamed column signifies. Plus, note that most of the unit_sales column is empty/null, and therefore the log_sales column turns out to be uninformative.

landkwon94 commented 1 year ago

Hello Sir! Thank you so much for your reply :)

I tried lots of experiments recently, but still got similar errors until now.

I am now PhD student in South Korea, and doing ecological sensing research.

My purpose of using TFT is for forecasting vegetation health through merging multiple variates such as weather, temperature, soil moisture, etc.

So I wanted to format my own datasets same with your input train datasets. That's the reason I wanted to preprocess with your methods :)

May I ask you for your output data after performing these codes? https://playtikaoss.github.io/tft-torch/build/html/tutorials/DataGenerationExample.html

I really just want to check how data_path = '.../data/favorita/data.pickle' in train process (https://playtikaoss.github.io/tft-torch/build/html/tutorials/TrainingExample.html#) are formatted!

If I can see the format of train data pickle file, I think I can proceed your codes!

This is my email address!

twinsben94@snu.ac.kr

I really want to say your contributions for TFT codes :) I will wait for your reply! Have a nice weekends 👍

Sincerely, Ryoungseob

Dvirbeno commented 1 year ago

Sure.

In [1]: import pickle

In [2]: with open('data.pickle','rb') as fp:
   ...:     data = pickle.load(fp)

In [3]: data.keys()
Out[3]: dict_keys(['data_sets', 'feature_map', 'scalers', 'categorical_cardinalities'])

In [4]: type(data['data_sets'])
Out[4]: dict

In [5]: data['data_sets'].keys()
Out[5]: dict_keys(['train', 'validation', 'test'])

In [6]: data['data_sets']['train'].keys()
Out[6]: dict_keys(['time_index', 'combination_id', 'static_feats_numeric', 'static_feats_categorical', 'historical_ts_numeric', 'historical_ts_categorical', 'future_ts_numeric', 'future_ts_categorical', 'target'])

In [7]: for subset in data['data_sets']:
   ...:     print(f"Subset: {subset}")
   ...:     print('='*20)
   ...:
   ...:     for key in data['data_sets'][subset]:
   ...:         print(key)
   ...:         arr = data['data_sets'][subset][key]
   ...:         print(f"type: {type(arr)} Shape: {arr.shape} Dtype {arr.dtype}")
   ...:
Subset: train
====================
time_index
type: <class 'numpy.ndarray'> Shape: (11532481,) Dtype object
combination_id
type: <class 'numpy.ndarray'> Shape: (11532481,) Dtype <U10
static_feats_numeric
type: <class 'numpy.ndarray'> Shape: (11532481, 0) Dtype float32
static_feats_categorical
type: <class 'numpy.ndarray'> Shape: (11532481, 9) Dtype int32
historical_ts_numeric
type: <class 'numpy.ndarray'> Shape: (11532481, 90, 4) Dtype float32
historical_ts_categorical
type: <class 'numpy.ndarray'> Shape: (11532481, 90, 7) Dtype int32
future_ts_numeric
type: <class 'numpy.ndarray'> Shape: (11532481, 30, 1) Dtype float32
future_ts_categorical
type: <class 'numpy.ndarray'> Shape: (11532481, 30, 7) Dtype int32
target
type: <class 'numpy.ndarray'> Shape: (11532481, 30) Dtype float32
Subset: validation
====================
time_index
type: <class 'numpy.ndarray'> Shape: (120833,) Dtype object
combination_id
type: <class 'numpy.ndarray'> Shape: (120833,) Dtype <U10
static_feats_numeric
type: <class 'numpy.ndarray'> Shape: (120833, 0) Dtype float32
static_feats_categorical
type: <class 'numpy.ndarray'> Shape: (120833, 9) Dtype int32
historical_ts_numeric
type: <class 'numpy.ndarray'> Shape: (120833, 90, 4) Dtype float32
historical_ts_categorical
type: <class 'numpy.ndarray'> Shape: (120833, 90, 7) Dtype int32
future_ts_numeric
type: <class 'numpy.ndarray'> Shape: (120833, 30, 1) Dtype float32
future_ts_categorical
type: <class 'numpy.ndarray'> Shape: (120833, 30, 7) Dtype int32
target
type: <class 'numpy.ndarray'> Shape: (120833, 30) Dtype float32
Subset: test
====================
time_index
type: <class 'numpy.ndarray'> Shape: (3454260,) Dtype object
combination_id
type: <class 'numpy.ndarray'> Shape: (3454260,) Dtype <U10
static_feats_numeric
type: <class 'numpy.ndarray'> Shape: (3454260, 0) Dtype float32
static_feats_categorical
type: <class 'numpy.ndarray'> Shape: (3454260, 9) Dtype int32
historical_ts_numeric
type: <class 'numpy.ndarray'> Shape: (3454260, 90, 4) Dtype float32
historical_ts_categorical
type: <class 'numpy.ndarray'> Shape: (3454260, 90, 7) Dtype int32
future_ts_numeric
type: <class 'numpy.ndarray'> Shape: (3454260, 30, 1) Dtype float32
future_ts_categorical
type: <class 'numpy.ndarray'> Shape: (3454260, 30, 7) Dtype int32
target
type: <class 'numpy.ndarray'> Shape: (3454260, 30) Dtype float32

In [8]: data['feature_map']  #Dict of lists
Out[8:
{'static_feats_numeric': [],
 'static_feats_categorical': ['store_nbr',
  'item_nbr',
  'city',
  'state',
  'store_type',
  'store_cluster',
  'item_family',
  'item_class',
  'perishable'],
 'historical_ts_numeric': ['log_sales',
  'day_of_month',
  'transactions',
  'oil_price'],
 'historical_ts_categorical': ['onpromotion',
  'open',
  'day_of_week',
  'month',
  'national_holiday',
  'regional_holiday',
  'local_holiday'],
 'future_ts_numeric': ['day_of_month'],
 'future_ts_categorical': ['onpromotion',
  'open',
  'day_of_week',
  'month',
  'national_holiday',
  'regional_holiday',
  'local_holiday']}

In [9]: data['scalers']
Out[9]:
{'numeric': {'log_sales': StandardScaler(),
  'day_of_month': MinMaxScaler(),
  'transactions': QuantileTransformer(n_quantiles=256),
  'oil_price': QuantileTransformer(n_quantiles=256)},
 'categorical': {'store_nbr': LabelEncoder(),
  'item_nbr': LabelEncoder(),
  'onpromotion': LabelEncoder(),
  'open': LabelEncoder(),
  'day_of_week': LabelEncoder(),
  'month': LabelEncoder(),
  'city': LabelEncoder(),
  'state': LabelEncoder(),
  'store_type': LabelEncoder(),
  'store_cluster': LabelEncoder(),
  'item_family': LabelEncoder(),
  'item_class': LabelEncoder(),
  'perishable': LabelEncoder(),
  'national_holiday': LabelEncoder(),
  'regional_holiday': LabelEncoder(),
  'local_holiday': LabelEncoder()}}

In [110]: data['categorical_cardinalities']
Out[10]:
{'store_nbr': 53,
 'item_nbr': 3626,
 'onpromotion': 2,
 'open': 2,
 'day_of_week': 7,
 'month': 12,
 'city': 22,
 'state': 16,
 'store_type': 5,
 'store_cluster': 17,
 'item_family': 32,
 'item_class': 319,
 'perishable': 2,
 'national_holiday': 71,
 'regional_holiday': 5,
 'local_holiday': 27}

Does that help?

Dvirbeno commented 1 year ago

@landkwon94 does it answer your needs? can we close the issue?

landkwon94 commented 1 year ago

Yes Sir, it is okay to close it :) Thanks again!!

PlaytikaOSS / tft-torch

Empty pickle data after preparing train datasets #9