TeresaYang00 commented 4 days ago

MyGroup Review _Python -透過-PDF檔-爬蟲，下載MOPS所有公司「財務報告書(電子書)」

版本: 2023年

from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options import time from selenium.webdriver.common.by import By from selenium.common.exceptions import NoSuchElementException import pandas as pd

download_dir = "/Users/teresayang/Desktop/上市上櫃公司財報"

chrome_options = Options() chrome_options.add_experimental_option( 'prefs', {'download.prompt_for_download':False, 'plugins.always_open_pdf_externally':True, 'download.default_directory': download_dir})

driver = webdriver.Chrome(options=chrome_options)

/html/body/center/form/table[2]/tbody/tr[2]/td[8]/a 第一個財報xpath

/html/body/center/form/table[2]/tbody/tr[11]/td[8]/a 最後一個財報xpath

/html/body/center/form/table[2]/tbody/tr/td[8]/a 意思是所有財表的xpath

讀桌面的報表

file_path = '/Users/teresayang/Desktop/TW_company_code.csv' df = pd.read_csv(file_path,encoding='unicode_escape')

只需要第一欄

company_code= df.iloc[:, 0]

year=112 2023年

for code in company_code: driver.get(f'https://doc.twse.com.tw/server-java/t57sb01?step=1&co_id={code}&year=112&mtype=A')

xpath_list = driver.find_elements(By.XPATH, '/html/body/center/form/table[2]/tbody/tr/td[8]/a')

# Store the handle of the original window
original_window = driver.current_window_handle

# Loop through the list and click each element
for element in xpath_list:
    element.click()
    time.sleep(3)

    # Switch to the new window
    driver.switch_to.window(driver.window_handles[-1])

    try:
        # Perform actions in the new window
        true_link = driver.find_element(By.XPATH, '/html/body/center/a')
        true_link.click()
        time.sleep(25)
        #給予print時戳
        print('已經下載'+str(code))

    except Exception as e:
        print(f"An error occurred: {e}")

    # Close the new window
    driver.close()

    # Switch back to the original window
    driver.switch_to.window(original_window)

Clean up

driver.quit() print("done")

TeresaYang00 commented 4 days ago

take 1 hr 今日確認爬蟲是否有正常運行目前版本為2023年上市公司路徑並調整上櫃公司路徑爬蟲

TeresaYang00 commented 3 days ago

目前進度 2023年財報 6/25 爬蟲至20:16 即停止 6/26 早上排除問題股票代號2534這公司直接被跳過後續待手動下載補上 . 6/26 18:41 爬到股票代號 8438

TeresaYang00 commented 2 days ago

調整成上櫃公司2023年的程式碼

from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options import time from selenium.webdriver.common.by import By from selenium.common.exceptions import NoSuchElementException import pandas as pd

download_dir = "/Users/teresayang/Desktop/上市上櫃公司財報"

chrome_options = Options() chrome_options.add_experimental_option( 'prefs', {'download.prompt_for_download':False, 'plugins.always_open_pdf_externally':True, 'download.default_directory': download_dir})

driver = webdriver.Chrome(options=chrome_options)

讀取公司代碼清單

file_path = '/Users/teresayang/Desktop/TW_company_code.csv' df = pd.read_csv(file_path, encoding='unicode_escape')

取得第一欄的公司代碼

company_code = df.iloc[:, 0]

修改為上櫃公司財務報告的網址格式

base_url = 'https://mops.twse.com.tw/mops/web/t51sb01'

for code in company_code: driver.get(f'{base_url}?co_id={code}&year=112&mtype=A')

xpath_list = driver.find_elements(By.XPATH, '/html/body/center/form/table[2]/tbody/tr/td[8]/a')

# 儲存原始視窗的handle
original_window = driver.current_window_handle

# 逐一點擊每個財務報告連結
for element in xpath_list:
    element.click()
    time.sleep(3)

    # 切換到新開的視窗
    driver.switch_to.window(driver.window_handles[-1])

    try:
        # 在新視窗中進行操作
        true_link = driver.find_element(By.XPATH, '/html/body/center/a')
        true_link.click()
        time.sleep(25)
        print(f'已經下載{code}')

    except Exception as e:
        print(f"發生錯誤: {e}")

    # 關閉新視窗
    driver.close()

    # 切換回原始視窗
    driver.switch_to.window(original_window)

清理

driver.quit() print("完成")

CAFECA-IO / iSunFA

2021-2023年財務報告書確認爬蟲執行 #1264

MyGroup Review _Python -透過-PDF檔-爬蟲，下載MOPS所有公司「財務報告書(電子書)」

版本: 2023年

/html/body/center/form/table[2]/tbody/tr[2]/td[8]/a 第一個財報xpath

/html/body/center/form/table[2]/tbody/tr[11]/td[8]/a 最後一個財報xpath

/html/body/center/form/table[2]/tbody/tr/td[8]/a 意思是所有財表的xpath

讀桌面的報表

只需要第一欄

year=112 2023年

Clean up

讀取公司代碼清單

取得第一欄的公司代碼

修改為上櫃公司財務報告的網址格式

清理

CAFECA-IO / iSunFA

2021-2023年 財務報告書確認爬蟲執行 #1264

MyGroup Review _Python -透過-PDF檔-爬蟲，下載MOPS所有公司「財務報告書(電子書)」

版本: 2023年

/html/body/center/form/table[2]/tbody/tr[2]/td[8]/a 第一個財報xpath

/html/body/center/form/table[2]/tbody/tr[11]/td[8]/a 最後一個財報xpath

/html/body/center/form/table[2]/tbody/tr/td[8]/a 意思是所有財表的xpath

讀桌面的報表

只需要第一欄

year=112 2023年

Clean up

讀取公司代碼清單

取得第一欄的公司代碼

修改為上櫃公司財務報告的網址格式

清理

2021-2023年財務報告書確認爬蟲執行 #1264