Open saurL opened 4 months ago
I found my problem and it raised another underlying issue. I will start by explaining my problem and its resolution, then I will raise a problem related to data processing by FeatureEngineer (or not).
My problem was that when YahooDownloader downloaded the data, there were certain dates for which there was no data. This absence of data did not result in empty data (NAN) but in the absence of a row for a given date. This, when grouped by date, resulted in 49 data points instead of 50.
Solution:
def clean_data( data):
df = data.copy()
df = df.sort_values(["date", "tic"], ignore_index=True)
df.index = df.date.factorize()[0]
merged_closes = df.pivot_table(index="date", columns="tic", values="close")
empty_lines= merged_closes[merged_closes.isnull().any(axis=1)]
date_list = list(empty_lines.index)
df_clean = df[~df['date'].isin(date_list)]
return df_clean
During my investigation, I also tried to run the Stock_NeurIPS2018_2_Train.ipynb file with CAC40 data and encountered another type of error, but once again related to the data.
When FeatureEngineer processes the data, it executes this code to clean the data:
def clean_data(self, data):
"""
clean the raw data
deal with missing values
reasons: stocks could be delisted, not incorporated at the time step
:param data: (df) pandas dataframe
:return: (df) pandas dataframe
"""
df = data.copy()
df = df.sort_values(["date", "tic"], ignore_index=True)
df.index = df.date.factorize()[0]
merged_closes = df.pivot_table(index="date", columns="tic", values="close")
merged_closes = merged_closes.dropna(axis=1)
tics = merged_closes.columns
df = df[df.tic.isin(tics)]
# df = data.copy()
# list_ticker = df["tic"].unique().tolist()
# only apply to daily level data, need to fix for minute level
# list_date = list(pd.date_range(df['date'].min(),df['date'].max()).astype(str))
# combination = list(itertools.product(list_date,list_ticker))
# df_full = pd.DataFrame(combination,columns=["date","tic"]).merge(df,on=["date","tic"],how="left")
# df_full = df_full[df_full['date'].isin(df['date'])]
# df_full = df_full.sort_values(['date','tic'])
# df_full = df_full.fillna(0)
return df
This will delete all tickers for which there is not all the data for each date. That is, if a single data point is missing for a single date, it will delete all data related to that ticker.
I am not familiar with the entire project, which is why I ask the question: why not modify it to delete only the rows that do not have all the data (as in my code above) rather than all the data related to the ticker? I have time and, having already investigated the subject, I am willing to make the necessary changes, but I am aware that this may have impacts that I am unaware of on the viability of the model or other aspects.
I wanted to run FinRL_PortfolioOptimizationEnv_Demo and changing the Data source to the CAC40 here: But unfortunately I got an error when using the method train_model of DRLAgent: