UoB-DSMP-2023-24 / dsmp-2024-group-ol2

dsmp-2024-group-ol2 created by GitHub Classroom
0 stars 0 forks source link

Run EDA on full data set and write up findings #49

Open JackDI1 opened 5 months ago

JackDI1 commented 5 months ago

The first part of the EDA code requires us to convert some of the columns into the correct data types. This is very computationally expensive at this stage, given the size of the dataset. Can we do these conversions on the formation of the dataset? This will mean we don't need to worry about doing it for other down steam tasks.

JackDI1 commented 5 months ago

The following line of code has been running for 30mins and has not let finished:

#how many NaN/null in mid price?
missing_mid_price_count = lob['Mid_Price'].isnull().sum()

print(f'Missing "Mid_Price" values: {missing_mid_price_count} ({missing_mid_price_count/len(lob.index):.4f}% of the sample)')

edit: I eventually go an error: ConnectTimeoutError: Connect timeout on endpoint URL: "https://dsmp-ol2.s3.amazonaws.com/processed-data/lob_full_data.parquet" This is going to be an issue when moving on to complex analysis, what is the best option to move forward?