Closed armheb closed 4 years ago
Hi armheb,
Yes concatenation is one possible solution. However, if all files combined are too large to keep in memory you can use a DataGenerator to supply the data incrementally to the model, here an example.
`class DataGenerator(tf.keras.utils.Sequence):
def __init__(self, paths, seq_len, batch_size):
self.paths = paths
self.seq_len = seq_len
self.batch_size = batch_size
def __len__(self):
return (np.ceil(len(self.paths) / float(self.batch_size))).astype(np.int)
def __getitem__(self, idx):
path_batch = self.paths[idx*self.batch_size : (idx+1)*self.batch_size]
train_input_ohlcv, y = list(), list()
for path in path_batch:
seq = preprocess_data(path).values
for i in seq:
in_seq, out_seq = seq[i-self.seq_len:i], seq[I]
train_input_ohlcv.append(in_seq[-self.seq_len:])
y.append(out_seq[3])
train_input_ohlcv = np.asarray(train_input_ohlcv, dtype=np.float32)
y = np.asarray(y, dtype=np.float32)
return train_input_ohlcv, y
train_gen = DataGenerator(train_seq_paths, seq_per_file, seq_len, batch_size) val_gen = DataGenerator(val_seq_paths, seq_per_file, seq_len, batch_size) `
Then when training the model the fit function looks as follows
history = model.fit(train_gen, steps_per_epoch = len(train_seq_paths)//batch_size, batch_size = batch_size, verbose = 1, callbacks = [callbacks], epochs = 35, shuffle = True, validation_data = val_gen, validation_steps = len(val_seq_paths)//batch_size, max_queue_size = 2,)
I hope this helps.
Thank you so much for the great explanation, I will try that and share the results here.
I'm looking forward to the results. If you have any additional questions just let me know
Hi, thanks you've been very helpful, I modified your DataGenerator code a bit and got it to start training as you said, but now at the end of the first epoch, I get out memory errors for the GPU! Here is the error:
OOM when allocating tensor with shape[3736448,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model/transformer_encoder/multi_attention/single_attention_6/dense_20/Tensordot/MatMul (defined at
Function call stack: train_function
I'm training on a Titan XP with 12GB memory, and also decreased the batch size and seq_len but still getting the same error.
This is my DataGenerator class:
`class DataGenerator(tf.keras.utils.Sequence):
def __init__(self, paths, seq_len, batch_size):
self.paths = paths
self.seq_len = seq_len
self.batch_size = batch_size
def __len__(self):
return (np.ceil(len(self.paths) / float(self.batch_size))).astype(np.int)
def __getitem__(self, idx):
path_batch = self.paths[idx*self.batch_size : (idx+1)*self.batch_size]
train_input_ohlcv, y = list(), list()
for path in path_batch:
seq = preprocess_data(path).values
for i in range(self.seq_len,len(seq)):
in_seq, out_seq = seq[i-self.seq_len:i], seq[i]
train_input_ohlcv.append(in_seq[-self.seq_len:])
y.append(out_seq[1])
train_input_ohlcv = np.array(train_input_ohlcv)
y = np.array(y)
return train_input_ohlcv, y
`
If you are not shuffling ur files during training, it looks like that the last files that go into the generator have a lot of entries. What I can deriving from shape[3736448,256] is that ur are passing 3736448 sequences with a length of 256 into the model.
The 3736448 is the aggregated batch size of that file batch.
Just check whether u have very large file in ur dataset and potentially exclude it for now.
Thanks for your answer, I don't have a large file in the dataset, in the preprocess_data
function I return the preprocessed dataframe white the same shape, do you think there is anything wrong in the second loop of the generator class? That was the part I changed.
By adding all sequences together, the model can train, although it takes about 4.5 hours per epoch!
Hi, I trained the model for about a week but unfortunately, the final result was a straight line in the middle. do you have plans to update the repo? can please share your weight to finetune the model based on that? I really appreciate your work. thanks.
Hi, thanks for sharing your awesome work, I wanted to train the transformer model in multiple csvs which have the same time span, should I just contact them and make one big dataframe and train the model? Thanks