algotradingsoc / data_infrastructure

Research team for data infrastructure team.
Apache License 2.0
0 stars 0 forks source link

Minutes 30/10 #8

Closed kevinxuht closed 3 years ago

kevinxuht commented 3 years ago

Kaggle Price Data - don't have data to cross-check quality TW: No worries, cross-checking is not possible at the moment. A few checks would be to flag price data with large fluctuations (>50% single day) and missing values. Any periods with unusually high volatility can also be noted.

simple tasks: 1,2,3: delisting dates (Han), sort by volume (Archie), calender (Kevin)

Listing and Delisting dates for each security, search through the availablity dates from the data.

FinnHub ID is unique, match ticker to ID. Later on if more data is available, match them using ticker to FINNHUB ID.

  1. Take all these data into MongoDB (Guang)

Important to have the Database set up,

Later on, use MongoDB commands to create functions.

Each feature is a function to grab results from database. For now, work with few CSVs, and calculate desired results. Scale these functions into MongoDB commands.

MongoDB Tutorial - Hierachy: Database - Collection - Document A Collection is a Table in SQL terms A Document is a row entry of data in SQL terms MongoDB Document is like a dictionary in Python.

Commands and queries in MongoDB Collection. Aggregate commands.

TW: An example of data organisation.

Plan A: Collection represents each ticker

Assuming the database is called Kaggle_US_Equity Collection is TSLA A document is then the OHLCV of tesla at a particular day

Advantage: Easy to get the data for a single security

Plan B: Collection represent data at a single date

Assuming the database is called Kaggle_US_Equity Collection is 2020-10-30 A document is then the OHLCV of all the stocks listed on the date

Advantage: Easy to add new data Disadvantage: Need to merge data across collections and filtered to generate the dataframe of time-series

Features come later - more supportive role, built upon research request.

TW: The functions used to generate different metadata such as calendar, listing and delisting dates and volume should be put into a single script.

ThomasWong2022 commented 3 years ago

Update minutes comments.