JanSchm / CapMarket

278 stars 170 forks source link

Training on multiple csvs #1

Closed armheb closed 4 years ago

armheb commented 4 years ago

Hi, thanks for sharing your awesome work, I wanted to train the transformer model in multiple csvs which have the same time span, should I just contact them and make one big dataframe and train the model? Thanks

JanSchm commented 4 years ago

Hi armheb,

Yes concatenation is one possible solution. However, if all files combined are too large to keep in memory you can use a DataGenerator to supply the data incrementally to the model, here an example.

`class DataGenerator(tf.keras.utils.Sequence):

def __init__(self, paths, seq_len, batch_size):
    self.paths = paths
    self.seq_len = seq_len
    self.batch_size = batch_size

def __len__(self):
    return (np.ceil(len(self.paths) / float(self.batch_size))).astype(np.int)

def __getitem__(self, idx):
    path_batch = self.paths[idx*self.batch_size : (idx+1)*self.batch_size]

    train_input_ohlcv, y = list(), list()
    for path in path_batch:
        seq = preprocess_data(path).values

        for i in seq:
            in_seq, out_seq = seq[i-self.seq_len:i], seq[I]

            train_input_ohlcv.append(in_seq[-self.seq_len:]) 

            y.append(out_seq[3])

    train_input_ohlcv = np.asarray(train_input_ohlcv, dtype=np.float32)
    y = np.asarray(y, dtype=np.float32)
    return train_input_ohlcv, y

train_gen = DataGenerator(train_seq_paths, seq_per_file, seq_len, batch_size) val_gen = DataGenerator(val_seq_paths, seq_per_file, seq_len, batch_size) `

Then when training the model the fit function looks as follows

history = model.fit(train_gen, steps_per_epoch = len(train_seq_paths)//batch_size, batch_size = batch_size, verbose = 1, callbacks = [callbacks], epochs = 35, shuffle = True, validation_data = val_gen, validation_steps = len(val_seq_paths)//batch_size, max_queue_size = 2,)

I hope this helps.

armheb commented 4 years ago

Thank you so much for the great explanation, I will try that and share the results here.

JanSchm commented 4 years ago

I'm looking forward to the results. If you have any additional questions just let me know

armheb commented 4 years ago

Hi, thanks you've been very helpful, I modified your DataGenerator code a bit and got it to start training as you said, but now at the end of the first epoch, I get out memory errors for the GPU! Here is the error: OOM when allocating tensor with shape[3736448,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model/transformer_encoder/multi_attention/single_attention_6/dense_20/Tensordot/MatMul (defined at :25) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_train_function_38226]

Function call stack: train_function

I'm training on a Titan XP with 12GB memory, and also decreased the batch size and seq_len but still getting the same error.

armheb commented 4 years ago

This is my DataGenerator class:

`class DataGenerator(tf.keras.utils.Sequence):

def __init__(self, paths, seq_len, batch_size):
    self.paths = paths
    self.seq_len = seq_len
    self.batch_size = batch_size

def __len__(self):
    return (np.ceil(len(self.paths) / float(self.batch_size))).astype(np.int)

def __getitem__(self, idx):
    path_batch = self.paths[idx*self.batch_size : (idx+1)*self.batch_size]

    train_input_ohlcv, y = list(), list()
    for path in path_batch:
        seq = preprocess_data(path).values

        for i in range(self.seq_len,len(seq)):
            in_seq, out_seq = seq[i-self.seq_len:i], seq[i]

            train_input_ohlcv.append(in_seq[-self.seq_len:]) 

            y.append(out_seq[1])

    train_input_ohlcv = np.array(train_input_ohlcv)
    y = np.array(y)
    return train_input_ohlcv, y
`
JanSchm commented 4 years ago

If you are not shuffling ur files during training, it looks like that the last files that go into the generator have a lot of entries. What I can deriving from shape[3736448,256] is that ur are passing 3736448 sequences with a length of 256 into the model.

The 3736448 is the aggregated batch size of that file batch.

Just check whether u have very large file in ur dataset and potentially exclude it for now.

armheb commented 4 years ago

Thanks for your answer, I don't have a large file in the dataset, in the preprocess_data function I return the preprocessed dataframe white the same shape, do you think there is anything wrong in the second loop of the generator class? That was the part I changed.

armheb commented 4 years ago

By adding all sequences together, the model can train, although it takes about 4.5 hours per epoch!

armheb commented 4 years ago

Hi, I trained the model for about a week but unfortunately, the final result was a straight line in the middle. do you have plans to update the repo? can please share your weight to finetune the model based on that? I really appreciate your work. thanks.