Closed landkwon94 closed 1 year ago
Hi, and thanks for trying this out!
First, note that this is just an example of how data can be generated, and how the input data (for the next stage) should be structured. Here, we specifically chose to work on one of the datasets mentioned in the paper.
Running this specific example shouldn't take too long and should only be done once.
Up until now, I didn't bump into the problem you're raising (probably because I didn't try to modify the input data file :)).
You can walk through the example again, and check what you're left with when you get to the Splitting Data
stage (https://playtikaoss.github.io/tft-torch/build/html/tutorials/DataGenerationExample.html#Splitting-Data) - what does data_df
contain?
Update if there are any findings.
Hello Sir,
Thank you so much for fast reply!!
First, the code below takes so much time in my PC (around a week, my memory and CPU is not small though),, Is it similar in your PC?
for col in tqdm(feature_cols): if col in categorical_attrs: le = scalers['categorical'][col] le_dict = dict(zip(le.classes_, le.transform(le.classes_))) data_df[col] = data_df[col].apply(lambda x: le_dict.get(x, max(le.transform(le.classes_)) + 1)) data_df[col] = data_df[col].astype(np.int32) else: data_df[col] = scalers['numeric'][col].transform(data_df[col].values.reshape(-1, 1)).squeeze() data_df[col] = data_df[col].astype(np.float32)
Second, I am trying to follow your comments! I ran again the code with the original data (not modified one) :)
Third, when I reduced train.csv and ran the code, I saved the data_df into csv file. The below contents are the sample rows of data_df (modified version).
I will wait for your reply, while running the codes with original datasets (not modified version).
Super-thanks for your time and contributions!! 👍
Hi again,
First, the code below takes so much time in my PC (around a week, my memory and CPU is not small though),, Is it similar in your PC?
No, it takes no more than a few minutes. Definitely not a week.
Third, when I reduced train.csv and ran the code, I saved the data_df into csv file. The below contents are the sample rows of data_df (modified version).
It seems like something got messed up in the resulted dataframe (which explains why you end up with empty dictionaries eventually). From the csv sheet, it's hard to understand what does the Unnamed
column signifies. Plus, note that most of the unit_sales
column is empty/null, and therefore the log_sales
column turns out to be uninformative.
Hello Sir! Thank you so much for your reply :)
I tried lots of experiments recently, but still got similar errors until now.
I am now PhD student in South Korea, and doing ecological sensing research.
My purpose of using TFT is for forecasting vegetation health through merging multiple variates such as weather, temperature, soil moisture, etc.
So I wanted to format my own datasets same with your input train datasets. That's the reason I wanted to preprocess with your methods :)
May I ask you for your output data after performing these codes? https://playtikaoss.github.io/tft-torch/build/html/tutorials/DataGenerationExample.html
I really just want to check how data_path = '.../data/favorita/data.pickle' in train process (https://playtikaoss.github.io/tft-torch/build/html/tutorials/TrainingExample.html#) are formatted!
If I can see the format of train data pickle file, I think I can proceed your codes!
This is my email address!
twinsben94@snu.ac.kr
I really want to say your contributions for TFT codes :) I will wait for your reply! Have a nice weekends 👍
Sincerely, Ryoungseob
Sure.
In [1]: import pickle
In [2]: with open('data.pickle','rb') as fp:
...: data = pickle.load(fp)
In [3]: data.keys()
Out[3]: dict_keys(['data_sets', 'feature_map', 'scalers', 'categorical_cardinalities'])
In [4]: type(data['data_sets'])
Out[4]: dict
In [5]: data['data_sets'].keys()
Out[5]: dict_keys(['train', 'validation', 'test'])
In [6]: data['data_sets']['train'].keys()
Out[6]: dict_keys(['time_index', 'combination_id', 'static_feats_numeric', 'static_feats_categorical', 'historical_ts_numeric', 'historical_ts_categorical', 'future_ts_numeric', 'future_ts_categorical', 'target'])
In [7]: for subset in data['data_sets']:
...: print(f"Subset: {subset}")
...: print('='*20)
...:
...: for key in data['data_sets'][subset]:
...: print(key)
...: arr = data['data_sets'][subset][key]
...: print(f"type: {type(arr)} Shape: {arr.shape} Dtype {arr.dtype}")
...:
Subset: train
====================
time_index
type: <class 'numpy.ndarray'> Shape: (11532481,) Dtype object
combination_id
type: <class 'numpy.ndarray'> Shape: (11532481,) Dtype <U10
static_feats_numeric
type: <class 'numpy.ndarray'> Shape: (11532481, 0) Dtype float32
static_feats_categorical
type: <class 'numpy.ndarray'> Shape: (11532481, 9) Dtype int32
historical_ts_numeric
type: <class 'numpy.ndarray'> Shape: (11532481, 90, 4) Dtype float32
historical_ts_categorical
type: <class 'numpy.ndarray'> Shape: (11532481, 90, 7) Dtype int32
future_ts_numeric
type: <class 'numpy.ndarray'> Shape: (11532481, 30, 1) Dtype float32
future_ts_categorical
type: <class 'numpy.ndarray'> Shape: (11532481, 30, 7) Dtype int32
target
type: <class 'numpy.ndarray'> Shape: (11532481, 30) Dtype float32
Subset: validation
====================
time_index
type: <class 'numpy.ndarray'> Shape: (120833,) Dtype object
combination_id
type: <class 'numpy.ndarray'> Shape: (120833,) Dtype <U10
static_feats_numeric
type: <class 'numpy.ndarray'> Shape: (120833, 0) Dtype float32
static_feats_categorical
type: <class 'numpy.ndarray'> Shape: (120833, 9) Dtype int32
historical_ts_numeric
type: <class 'numpy.ndarray'> Shape: (120833, 90, 4) Dtype float32
historical_ts_categorical
type: <class 'numpy.ndarray'> Shape: (120833, 90, 7) Dtype int32
future_ts_numeric
type: <class 'numpy.ndarray'> Shape: (120833, 30, 1) Dtype float32
future_ts_categorical
type: <class 'numpy.ndarray'> Shape: (120833, 30, 7) Dtype int32
target
type: <class 'numpy.ndarray'> Shape: (120833, 30) Dtype float32
Subset: test
====================
time_index
type: <class 'numpy.ndarray'> Shape: (3454260,) Dtype object
combination_id
type: <class 'numpy.ndarray'> Shape: (3454260,) Dtype <U10
static_feats_numeric
type: <class 'numpy.ndarray'> Shape: (3454260, 0) Dtype float32
static_feats_categorical
type: <class 'numpy.ndarray'> Shape: (3454260, 9) Dtype int32
historical_ts_numeric
type: <class 'numpy.ndarray'> Shape: (3454260, 90, 4) Dtype float32
historical_ts_categorical
type: <class 'numpy.ndarray'> Shape: (3454260, 90, 7) Dtype int32
future_ts_numeric
type: <class 'numpy.ndarray'> Shape: (3454260, 30, 1) Dtype float32
future_ts_categorical
type: <class 'numpy.ndarray'> Shape: (3454260, 30, 7) Dtype int32
target
type: <class 'numpy.ndarray'> Shape: (3454260, 30) Dtype float32
In [8]: data['feature_map'] #Dict of lists
Out[8:
{'static_feats_numeric': [],
'static_feats_categorical': ['store_nbr',
'item_nbr',
'city',
'state',
'store_type',
'store_cluster',
'item_family',
'item_class',
'perishable'],
'historical_ts_numeric': ['log_sales',
'day_of_month',
'transactions',
'oil_price'],
'historical_ts_categorical': ['onpromotion',
'open',
'day_of_week',
'month',
'national_holiday',
'regional_holiday',
'local_holiday'],
'future_ts_numeric': ['day_of_month'],
'future_ts_categorical': ['onpromotion',
'open',
'day_of_week',
'month',
'national_holiday',
'regional_holiday',
'local_holiday']}
In [9]: data['scalers']
Out[9]:
{'numeric': {'log_sales': StandardScaler(),
'day_of_month': MinMaxScaler(),
'transactions': QuantileTransformer(n_quantiles=256),
'oil_price': QuantileTransformer(n_quantiles=256)},
'categorical': {'store_nbr': LabelEncoder(),
'item_nbr': LabelEncoder(),
'onpromotion': LabelEncoder(),
'open': LabelEncoder(),
'day_of_week': LabelEncoder(),
'month': LabelEncoder(),
'city': LabelEncoder(),
'state': LabelEncoder(),
'store_type': LabelEncoder(),
'store_cluster': LabelEncoder(),
'item_family': LabelEncoder(),
'item_class': LabelEncoder(),
'perishable': LabelEncoder(),
'national_holiday': LabelEncoder(),
'regional_holiday': LabelEncoder(),
'local_holiday': LabelEncoder()}}
In [110]: data['categorical_cardinalities']
Out[10]:
{'store_nbr': 53,
'item_nbr': 3626,
'onpromotion': 2,
'open': 2,
'day_of_week': 7,
'month': 12,
'city': 22,
'state': 16,
'store_type': 5,
'store_cluster': 17,
'item_family': 32,
'item_class': 319,
'perishable': 2,
'national_holiday': 71,
'regional_holiday': 5,
'local_holiday': 27}
Does that help?
@landkwon94 does it answer your needs? can we close the issue?
Yes Sir, it is okay to close it :) Thanks again!!
Hello Sir, first of all, many thanks for your contributions for TFT github codes!
I followed your tutorials in this link : https://playtikaoss.github.io/tft-torch/build/html/index.html
And I found it took so much time for preparing train datasets (https://playtikaoss.github.io/tft-torch/build/html/tutorials/DataGenerationExample.html)
So, I reduced the train.csv (randomly picking rows) to quickly see the train/test results.
But after I train TFT again, I got empty pickle datasets:
terminal is as follow --->
May I get some advice about this errors?
Thank you so much Sir!