Feature proposal: batch_size arg for the download function

kvrban commented 1 year ago

To further reduce the overfitting, it might also be useful to divide single records of pairs from a exchange into batches. So that the agent gets to see random time segments from a dataset in each episode when using MultiDatasetTradingEnv

The goal would be to reduce the probability that the agent simply learns the long-term price trend by heart. by this, he only sees random sections of the data set.

Does the thought process make sense? If yes, I would finish a PR to extend the download function with an optional argument "batch_size".

Something like this:

async def _download_symbol(exchange, symbol, timeframe='5m', since=int(datetime.datetime(year=2020, month=1, day=1).timestamp() * 1E3), until=int(datetime.datetime.now().timestamp() * 1E3), limit=1000, pause_every=10, pause=1, batch_size=None):
    timedelta = int(pd.Timedelta(timeframe).to_timedelta64() / 1E6)
    tasks = []
    results = []
    batch_num = 1
    for step_since in range(since, until, limit * timedelta):
        tasks.append(
            asyncio.create_task(_ohlcv(exchange, symbol, timeframe, limit, step_since, timedelta))
        )
        if len(tasks) >= pause_every:
            results.extend(await asyncio.gather(*tasks))
            await asyncio.sleep(pause)
            tasks = []
            if batch_size is not None and batch_num % batch_size == 0:
                final_df = pd.concat(results, ignore_index=True)
                final_df = final_df.loc[(since < final_df["timestamp_open"]) & (final_df["timestamp_open"] < until), :]
                del final_df["timestamp_open"]
                final_df.set_index('date_open', drop=True, inplace=True)
                final_df.sort_index(inplace=True)
                final_df.dropna(inplace=True)
                final_df.drop_duplicates(inplace=True)
                save_file = f"{dir}/{exchange.id}-{symbol.replace('/', '')}-{timeframe}-batch{batch_num}.pkl"
                final_df.to_pickle(save_file)
                print(f"{symbol} downloaded from {exchange.id} and stored at {save_file}")
                results = []
                batch_num += 1
    if len(tasks) > 0:
        results.extend(await asyncio.gather(*tasks))
    if len(results) > 0:
        final_df = pd.concat(results, ignore_index=True)
        final_df = final_df.loc[(since < final_df["timestamp_open"]) & (final_df["timestamp_open"] < until), :]
        del final_df["timestamp_open"]
        final_df.set_index('date_open', drop=True, inplace=True)
        final_df.sort_index(inplace=True)
        final_df.dropna(inplace=True)
        final_df.drop_duplicates(inplace=True)
        save_file = f"{dir}/{exchange.id}-{symbol.replace('/', '')}-{timeframe}-batch{batch_num}.pkl"
        final_df.to_pickle(save_file)
        print(f"{symbol} downloaded from {exchange.id} and stored at {save_file}")

ClementPerroud commented 1 year ago

Hi, thank you for the idea. Indeed, this would be nice to add this idea in the environment ! To me, I have 2 options to implement this :

Separate the data from the very beginning, as you presented (thanks for your work !). But it might cause some issues with performances, the preprocessing system, and It won't work with a basic TradindEnv. Example : If batch_size is relatively low, let's assume it is 500. The episodes will be short so the MultiDatasetTradingEnv will have to reload a dataset, preprocess it before being operational very frequently. Plus, if the user uses a preprocess function that drops some row, It might highly reduce the length of an episode even make it 0 (I personally use a preprocessing function that drops the 200 rows to make average indicators)
My solution to solve those issue is to truncate episodes into the environment by adding an optional "max_episode_length" (default : None) parameter to TradingEnv and MultiDatasetTradingEnv. If "max_episode_length" is set by the user, each episode will be truncated to when it reaches the amount indicated. Plus, every episode will start at a random point in the DataFrame and with a random portfolio position (among the "positions" parameters).

This solution seems better as the dataset will be preprocessed only once, so the performance will be better and the reduction caused will be relatively much smaller).

Plus, for the MultiDatasetTradingEnv, it would be nice to add some kind of "number_of_episode_per_dataset" (with default value being something like 10). This way, we can enjoy the same benefits mentioned above.

I would like to have your opinion about this ! What do you think ?

kvrban commented 1 year ago

Your solution is the clean one. Mine was more of a quick and dirty hack.

A suggestion on the name of the option: I would call it "shuffled_episode_length" to give an indication that it not only defines the length of the episode, but also randomly sets the start position. This should make the main purpose of this option more obvious.

About "number_of_episode_per_dataset" its purpose is not yet clear to me.

If the option (I use the term now) "shuffled_episode_length" is set, it should work the same in TradingEnv and MultiDatasetTradingEnv?

ClementPerroud commented 1 year ago

"number_of_episode_per_dataset" would be for the MultiDatasetTradingEnv class. As this env automatically switch from one dataset to another at the end of each episode, it would be a performance issue to shorten to episode lenght. So It might be useful to reuse several times the same dataset (by performing several episodes in a row) before changing to another dataset. Is "episodes_before_dataset_switch" clearer ?

kvrban commented 1 year ago

Now it becomes clear. makes sense. thanks for the explanation!

ClementPerroud commented 1 year ago

It is done : I added parameters :

"max_episode_duration" to TradingEnv
"episodes_between_dataset_switch" to MultiDatasetTradingEnv And the documentation has been modified to include these elements !

kvrban commented 1 year ago

Perfect. Can not wait to try today

ClementPerroud / Gym-Trading-Env

Feature proposal: batch_size arg for the download function #2