Yoontae6719 / SimStock-Representation-Model-for-Stock-Similarities

Official Implementation of SimStock : Representation Model for Stock Similarities
GNU General Public License v3.0
61 stars 12 forks source link

1_get_dataset.ipynb file is corrupted and cannot be opened #1

Closed aEgoist closed 7 months ago

aEgoist commented 7 months ago

hello, according to the documentation, we need the raw data to replicate the whole process, but 1_get_dataset.ipynb file is corrupted, so please upload again, many thanks!

Yoontae6719 commented 7 months ago

Thank you very much for your interest in my research. This study will add several datasets (3-statement, firm description, etc.) and release the results of some additional experiments. Below is an example of collecting data listed on the NASDAQ and NYSE exchanges, but the code is not cleaned up. However, if you refer to preprocess_stock in ultils, you can get a representation vector by using only the code below (not sure). More accurately, data from financial prep may require payment, but we will provide as much clean code that can be replicated without payment, i.e., by collecting data through other libraries. Anyway, as mentioned in the readme, additional code cleanup will begin in September, which is when my doctoral defense is over. If you have any further questions, please contact me at yoontae@unist.ac.kr for a faster response.

import pandas as pd
import FinanceDataReader as fdr
import yfinance as yf
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import yahoo_fin.stock_info as si
from yahoo_fin.stock_info import get_data
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np

import yfinance as yf
import pandas as pd
import numpy as np
from yahoo_fin import stock_info as si
from concurrent.futures import ThreadPoolExecutor, as_completed
from sklearn.preprocessing import StandardScaler
from utils.prepro import preprocess_stock

nasdaq_symbol = fdr.StockListing('NASDAQ')
nyse_symbol = fdr.StockListing('NYSE')

nyse_symbol["exchange"] = "NYES"
nasdaq_symbol["exchange"] = "NASDAQ"
nyse_symbol = nyse_symbol[~nyse_symbol["Industry"].str.contains("펀드", na=False)]
nasdaq_symbol = nasdaq_symbol[~nasdaq_symbol["Industry"].str.contains("펀드", na=False)]

stock_list = pd.concat([nasdaq_symbol, nyse_symbol]).dropna().reset_index(drop = True)
stock_list = stock_list.drop("Name", axis = 1)
stock_list = stock_list.drop_duplicates().reset_index(drop = True)

def fetch_data(symbol, exchange, industry_code):
    try:
        if exchange == "TSE":
            symbol = symbol + ".T"
        elif exchange == "SZSE":
            symbol = symbol + ".SZ"
        elif exchange == "SSE":
            symbol = symbol + ".SS"
        elif exchange == "KOSPI":
            symbol = symbol + ".KS"    

        ticker = yf.Ticker(symbol)
        df = ticker.history(start="2018-01-01",end = "2023-11-21", interval="1d").reset_index()[["Date", "Open","High","Low","Close","Volume"]]   
        df = preprocess_stock(df)
        df["Stock_"] = symbol
        df["IndustryCode_"] = industry_code
    except Exception as e:
        print(f"Error fetching data for {symbol}: {e}")
        return None

    return df

def get_all_data(stock_list):
    with ThreadPoolExecutor() as executor:
        futures = [executor.submit(fetch_data, symbol, exchange, industry_code) for symbol, exchange, industry_code in zip(stock_list["Symbol"], stock_list["exchange"], stock_list["IndustryCode"])]
        results = [future.result() for future in as_completed(futures)]

    data = pd.concat([result for result in results if result is not None])

    return data

nasdaq_data = get_all_data(stock_list)