ClementPerroud / Gym-Trading-Env

A simple, easy, customizable Gymnasium environment for trading.
https://gym-trading-env.readthedocs.io/
MIT License
304 stars 66 forks source link

Feature proposal: batch_size arg for the download function #2

Closed kvrban closed 1 year ago

kvrban commented 1 year ago

To further reduce the overfitting, it might also be useful to divide single records of pairs from a exchange into batches. So that the agent gets to see random time segments from a dataset in each episode when using MultiDatasetTradingEnv

The goal would be to reduce the probability that the agent simply learns the long-term price trend by heart. by this, he only sees random sections of the data set.

Does the thought process make sense? If yes, I would finish a PR to extend the download function with an optional argument "batch_size".

Something like this:

async def _download_symbol(exchange, symbol, timeframe='5m', since=int(datetime.datetime(year=2020, month=1, day=1).timestamp() * 1E3), until=int(datetime.datetime.now().timestamp() * 1E3), limit=1000, pause_every=10, pause=1, batch_size=None):
    timedelta = int(pd.Timedelta(timeframe).to_timedelta64() / 1E6)
    tasks = []
    results = []
    batch_num = 1
    for step_since in range(since, until, limit * timedelta):
        tasks.append(
            asyncio.create_task(_ohlcv(exchange, symbol, timeframe, limit, step_since, timedelta))
        )
        if len(tasks) >= pause_every:
            results.extend(await asyncio.gather(*tasks))
            await asyncio.sleep(pause)
            tasks = []
            if batch_size is not None and batch_num % batch_size == 0:
                final_df = pd.concat(results, ignore_index=True)
                final_df = final_df.loc[(since < final_df["timestamp_open"]) & (final_df["timestamp_open"] < until), :]
                del final_df["timestamp_open"]
                final_df.set_index('date_open', drop=True, inplace=True)
                final_df.sort_index(inplace=True)
                final_df.dropna(inplace=True)
                final_df.drop_duplicates(inplace=True)
                save_file = f"{dir}/{exchange.id}-{symbol.replace('/', '')}-{timeframe}-batch{batch_num}.pkl"
                final_df.to_pickle(save_file)
                print(f"{symbol} downloaded from {exchange.id} and stored at {save_file}")
                results = []
                batch_num += 1
    if len(tasks) > 0:
        results.extend(await asyncio.gather(*tasks))
    if len(results) > 0:
        final_df = pd.concat(results, ignore_index=True)
        final_df = final_df.loc[(since < final_df["timestamp_open"]) & (final_df["timestamp_open"] < until), :]
        del final_df["timestamp_open"]
        final_df.set_index('date_open', drop=True, inplace=True)
        final_df.sort_index(inplace=True)
        final_df.dropna(inplace=True)
        final_df.drop_duplicates(inplace=True)
        save_file = f"{dir}/{exchange.id}-{symbol.replace('/', '')}-{timeframe}-batch{batch_num}.pkl"
        final_df.to_pickle(save_file)
        print(f"{symbol} downloaded from {exchange.id} and stored at {save_file}")
ClementPerroud commented 1 year ago

Hi, thank you for the idea. Indeed, this would be nice to add this idea in the environment ! To me, I have 2 options to implement this :

This solution seems better as the dataset will be preprocessed only once, so the performance will be better and the reduction caused will be relatively much smaller).

Plus, for the MultiDatasetTradingEnv, it would be nice to add some kind of "number_of_episode_per_dataset" (with default value being something like 10). This way, we can enjoy the same benefits mentioned above.

I would like to have your opinion about this ! What do you think ?

kvrban commented 1 year ago

Your solution is the clean one. Mine was more of a quick and dirty hack.

A suggestion on the name of the option: I would call it "shuffled_episode_length" to give an indication that it not only defines the length of the episode, but also randomly sets the start position. This should make the main purpose of this option more obvious.

About "number_of_episode_per_dataset" its purpose is not yet clear to me.

If the option (I use the term now) "shuffled_episode_length" is set, it should work the same in TradingEnv and MultiDatasetTradingEnv?

ClementPerroud commented 1 year ago

"number_of_episode_per_dataset" would be for the MultiDatasetTradingEnv class. As this env automatically switch from one dataset to another at the end of each episode, it would be a performance issue to shorten to episode lenght. So It might be useful to reuse several times the same dataset (by performing several episodes in a row) before changing to another dataset. Is "episodes_before_dataset_switch" clearer ?

kvrban commented 1 year ago

Now it becomes clear. makes sense. thanks for the explanation!

ClementPerroud commented 1 year ago

It is done : I added parameters :

kvrban commented 1 year ago

Perfect. Can not wait to try today