AI4Finance-Foundation / FinRL

FinRL: Financial Reinforcement Learning. 🔥
https://ai4finance.org
MIT License
9.59k stars 2.33k forks source link

all the input array dimensions except for the concatenation axis must match exactly when running FinRL_PortfolioOptimizationEnv_Demo with CAC40 Data #1257

Open saurL opened 1 month ago

saurL commented 1 month ago

I wanted to run FinRL_PortfolioOptimizationEnv_Demo and changing the Data source to the CAC40 here: But unfortunately I got an error when using the method train_model of DRLAgent:


ValueError                                Traceback (most recent call last)
<ipython-input-50-63f854a52e04> in <cell line: 1>()
----> 1 DRLAgent.train_model(model, episodes=40)

4 frames

/usr/local/lib/python3.10/dist-packages/finrl/agents/portfolio_optimization/models.py in train_model(model, episodes)
     78             An instance of the trained model.
     79         """
---> 80         model.train(episodes)
     81         return model
     82 

/usr/local/lib/python3.10/dist-packages/finrl/agents/portfolio_optimization/algorithms.py in train(self, episodes)
    118 
    119                 # run simulation step
--> 120                 next_obs, reward, done, info = self.train_env.step(action)
    121 
    122                 # add experience to replay buffer

/usr/local/lib/python3.10/dist-packages/finrl/meta/env_portfolio_optimization/env_portfolio_optimization.py in step(self, actions)
    301             # load next state
    302             self._time_index += 1
--> 303             self._state, self._info = self._get_state_and_info_from_time_index(
    304                 self._time_index
    305             )

/usr/local/lib/python3.10/dist-packages/finrl/meta/env_portfolio_optimization/env_portfolio_optimization.py in _get_state_and_info_from_time_index(self, time_index)
    454             tic_data = tic_data[self._features].to_numpy().T
    455             tic_data = tic_data[..., np.newaxis]
--> 456 
    457             state = tic_data if state is None else np.append(state, tic_data, axis=2)
    458         state = state.transpose((0, 2, 1))

/usr/local/lib/python3.10/dist-packages/numpy/lib/function_base.py in append(arr, values, axis)
   5615             arr = arr.ravel()
   5616         values = ravel(values)
-> 5617         axis = arr.ndim-1
   5618     return concatenate((arr, values), axis=axis)
   5619 

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 49 and the array at index 1 has size 50```
saurL commented 1 month ago

I found my problem and it raised another underlying issue. I will start by explaining my problem and its resolution, then I will raise a problem related to data processing by FeatureEngineer (or not).

My problem was that when YahooDownloader downloaded the data, there were certain dates for which there was no data. This absence of data did not result in empty data (NAN) but in the absence of a row for a given date. This, when grouped by date, resulted in 49 data points instead of 50.

Solution:

def clean_data( data):
        df = data.copy()
        df = df.sort_values(["date", "tic"], ignore_index=True)
        df.index = df.date.factorize()[0]
        merged_closes = df.pivot_table(index="date", columns="tic", values="close")
        empty_lines= merged_closes[merged_closes.isnull().any(axis=1)]
        date_list = list(empty_lines.index)
        df_clean = df[~df['date'].isin(date_list)]

        return df_clean

During my investigation, I also tried to run the Stock_NeurIPS2018_2_Train.ipynb file with CAC40 data and encountered another type of error, but once again related to the data.

When FeatureEngineer processes the data, it executes this code to clean the data:

def clean_data(self, data):
        """
        clean the raw data
        deal with missing values
        reasons: stocks could be delisted, not incorporated at the time step
        :param data: (df) pandas dataframe
        :return: (df) pandas dataframe
        """
        df = data.copy()
        df = df.sort_values(["date", "tic"], ignore_index=True)
        df.index = df.date.factorize()[0]
        merged_closes = df.pivot_table(index="date", columns="tic", values="close")
        merged_closes = merged_closes.dropna(axis=1)
        tics = merged_closes.columns
        df = df[df.tic.isin(tics)]
        # df = data.copy()
        # list_ticker = df["tic"].unique().tolist()
        # only apply to daily level data, need to fix for minute level
        # list_date = list(pd.date_range(df['date'].min(),df['date'].max()).astype(str))
        # combination = list(itertools.product(list_date,list_ticker))

        # df_full = pd.DataFrame(combination,columns=["date","tic"]).merge(df,on=["date","tic"],how="left")
        # df_full = df_full[df_full['date'].isin(df['date'])]
        # df_full = df_full.sort_values(['date','tic'])
        # df_full = df_full.fillna(0)
        return df

This will delete all tickers for which there is not all the data for each date. That is, if a single data point is missing for a single date, it will delete all data related to that ticker.

I am not familiar with the entire project, which is why I ask the question: why not modify it to delete only the rows that do not have all the data (as in my code above) rather than all the data related to the ticker? I have time and, having already investigated the subject, I am willing to make the necessary changes, but I am aware that this may have impacts that I am unaware of on the viability of the model or other aspects.