AI4Finance-Foundation / FinGPT

FinGPT: Open-Source Financial Large Language Models! Revolutionize 🔥 We release the trained model on HuggingFace.
https://ai4finance.org
MIT License
13.48k stars 1.88k forks source link

dataset分享 #70

Closed YiFraternity closed 1 year ago

YiFraternity commented 1 year ago

有人可以分享一下从东方财富中下载的数据么?我运行download_contents已经一天了,程序仍然在运行,但是无任何输出结果。

YiFraternity commented 1 year ago

我知道问题所在了,之前的code如果没有request获取到数据,会出现死循环现象。

    # 请求5次失败后返回空 
    while not ok:
        try:
            response = requests.get(url = url, headers = headers, proxies=proxies, timeout=15)
            if response.status_code != 200:
                idx = idx + 1
                if idx > 5:
                    idx = 0
                    return [None]  * len(new_columns)
                continue
            res = etree.HTML(response.text)
            res = res.xpath("//script[2]//text()")[0]
            res = json.loads(res[17:])
            # res = pd.Series(res).to_frame().T
            ok = True
            idx = 0
            return [res[_] for _ in new_columns]
        except:
            idx = idx + 1
            if idx > 5:
                idx = 0
                return [None] * len(new_columns)
gubo303112564 commented 1 year ago

没问题了吗?有问题的话我可以帮你弄一下

gubo303112564 commented 1 year ago

东方财富的数据应该是很好下载的

YiFraternity commented 1 year ago

为了防止死循环,修改为多轮迭代式请求。

import os
import pandas as pd
import numpy as np
import requests
from lxml import etree
import multiprocessing as mp
import json
import time

# The result_path should be the results with only titles which is the IN path
# result_path = r"/root/finance/FinanceGPT/data/titles"

# The result_with_content_path should be the results with titles and contents which is the OUT path
VERSION = 1
CONTENT_PATH = r"/root/finance/FinanceGPT/data/titles_with_content_latest"
link_base = "https://guba.eastmoney.com"

idx = 0
MAX_REREQUEST = 15

new_columns = ['post_user', 'post_guba', 'post_publish_time', 'post_last_time',
    'post_display_time', 'post_ip', 'post_checkState', 'post_click_count',
    'post_forward_count', 'post_comment_count', 'post_comment_authority',
    'post_like_count', 'post_is_like', 'post_is_collected', 'post_type',
    'post_source_id', 'post_top_status', 'post_status', 'post_from',
    'post_from_num', 'post_pdf_url', 'post_has_pic',
    'has_pic_not_include_content', 'post_pic_url', 'source_post_id',
    'source_post_state', 'source_post_user_id', 'source_post_user_nickname',
    'source_post_user_type', 'source_post_user_is_majia',
    'source_post_pic_url', 'source_post_title', 'source_post_content',
    'source_post_abstract', 'source_post_ip', 'source_post_type',
    'source_post_guba', 'post_video_url', 'source_post_video_url',
    'source_post_source_id', 'code_name', 'product_type', 'v_user_code',
    'source_click_count', 'source_comment_count', 'source_forward_count',
    'source_publish_time', 'source_user_is_majia', 'ask_chairman_state',
    'selected_post_code', 'selected_post_name', 'selected_relate_guba',
    'ask_question', 'ask_answer', 'qa', 'fp_code', 'codepost_count',
    'extend', 'post_pic_url2', 'source_post_pic_url2', 'relate_topic',
    'source_extend', 'digest_type', 'source_post_atuser',
    'post_inshare_count', 'repost_state', 'post_atuser', 'reptile_state',
    'post_add_list', 'extend_version', 'post_add_time', 'post_modules',
    'post_speccolumn', 'post_ip_address', 'source_post_ip_address',
    'post_mod_time', 'post_mod_count', 'allow_likes_state',
    'system_comment_authority', 'limit_reply_user_auth', 'post_id',
    'post_title', 'post_content', 'post_abstract', 'post_state']

def update_new_columns(x):
    global idx, new_columns, MAX_REREQUEST

    if not pd.isna(x['post_user']):
        values = x[new_columns].values.tolist()
        return values

    print(f"当前content link: {x['content link']}")
    url = link_base + x["content link"]

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/112.0",
        "Referer": "https://guba.eastmoney.com/",
    }
    # tunnel = YOUR_KUAIDAILI_TUNNEL
    # username = YOUR_KUAIDAILI_USERNAME
    # password = YOUR_KUAIDAILI_PASSWARD
    tunnel = 'w595.kdltps.com:15818'
    username = 't19417558604757'
    password = '8bapp1l8'
    proxies = {
        "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel},
        "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel}
    }
    requests.DEFAULT_RETRIES = 10  # more retrys
    s = requests.session()
    s.keep_alive = False  # close connection when finished

    ok = False
    while not ok:
        try:
            response = requests.get(url=url, headers=headers, proxies=proxies, timeout=15)
            if response.status_code != 200:
                idx = idx + 1
                if idx > MAX_REREQUEST:
                    idx = 0
                    return [None] * len(new_columns)
                continue
            res = etree.HTML(response.text)
            res = res.xpath("//script[2]//text()")[0]
            res = json.loads(res[17:])
            # res = pd.Series(res).to_frame().T
            ok = True
            idx = 0
            return [res[_] for _ in new_columns]
        except:
            idx = idx + 1
            if idx > MAX_REREQUEST:
                idx = 0
                return [None] * len(new_columns)

def get_content(file_name):
    global new_columns
    content_path = CONTENT_PATH

    df = pd.read_csv(os.path.join(content_path, file_name))
    # df = df[0:50]
    need_retry = df['post_last_time'].isnull()
    if not any(need_retry):
        return
    df[new_columns] = df.apply(lambda x:update_new_columns(x), axis=1, result_type='expand')

    to_path = os.path.join(content_path, file_name)
    df.to_csv(to_path, index=False)

def is_continue(file_list):
    continue_tag = False
    content_path = CONTENT_PATH
    for fle in file_list:
        df = pd.read_csv(os.path.join(content_path, fle))
        need_retry = df['post_last_time'].isnull()
        if any(need_retry):
            print(fle)
            continue_tag = True
            break
    return continue_tag

if __name__ == "__main__":
    pool_list = []
    res_list = []
    file_list = os.listdir(CONTENT_PATH)
    # get_content(file_list[0])
    times = 1
    continue_tag = True
    while continue_tag:
        pool = mp.Pool(processes = 12)
        print(f'--------------------------第{times}更新--------------------------')
        for i in file_list:
            print(i)
            res = pool.apply_async(get_content, args = (i,), error_callback = lambda x:print(x))
        pool_list.append(res)

        # # 获取运行结果
        # for i in pool_list:
        #     res_list.append(i.get())
        pool.close()
        pool.join()

        print(f"--------------------第{times}更新 Done!--------------------")
        continue_tag = is_continue(file_list)
        times = times + 1

    # file_list = os.listdir(CONTENT_PATH)
    # continue_tag = True
    # times = 1
    # while continue_tag:
    #     print(f'--------------------------第{times}更新--------------------------')
    #     for i in file_list:
    #         print(i)
    #         get_content(i)
    #     continue_tag = is_continue(file_list)
    #     times = times + 1
YiFraternity commented 1 year ago

没问题了吗?有问题的话我可以帮你弄一下

搞定了,谢谢

tryssss commented 11 months ago

@YiFraternity 谢谢你的分享,还有注意删除你的账号密码和通道....

JBYhjz commented 10 months ago

没问题了吗?有问题的话我可以帮你弄一下

搞定了,谢谢

您好,我现在也遇到了这个情况,使用这段代码后,只有一轮更新然后就结束了,这怎么解决呢?

YiFraternity commented 10 months ago

没问题了吗?有问题的话我可以帮你弄一下

搞定了,谢谢

您好,我现在也遇到了这个情况,使用这段代码后,只有一轮更新然后就结束了,这怎么解决呢?

这段代码主要是防止死循环,防止一直请求【获取不到】的数据,不太清楚你的问题,你可以尝试着调试一下?

JBYhjz commented 10 months ago

截屏2023-11-13 13 11 03 就直接出现这个

YiFraternity commented 10 months ago
file_list

你的file_list是否为空

JBYhjz commented 10 months ago

是空的,那个路径不应该是下载后的文件么? 请问你是将下载后的title的csv文件,复制后放入了最终内容输出的那个文件夹么?

YiFraternity commented 10 months ago

需要先先下载title,才能下载content。file_list表示的是下载的title有哪些

JBYhjz commented 10 months ago

大佬,我这title也下载好了,然后也放入那个文件夹里了,然后又有错误 ![Uploading 截屏2023-11-13 13.50.52.png…]()

JBYhjz commented 10 months ago

截屏2023-11-13 13 50 52

YiFraternity commented 10 months ago

这个大概率是爬取的数据中没有['post_last_time']字段,不行的话,你加上try exception?

JBYhjz commented 10 months ago

这个大概率是爬取的数据中没有['post_last_time']字段,不行的话,你加上try exception?

大佬我加上了,感觉还不对呢,这个是在csv文件中找这些列么?这些csv文件中就没有这些列呢 截屏2023-11-13 14 15 30

JBYhjz commented 9 months ago

这个大概率是爬取的数据中没有['post_last_time']字段,不行的话,你加上try exception?

大佬,那可以分享下您下载好的数据么 0 v 0

YiFraternity commented 9 months ago

你发我一个邮箱吧,我把百度网盘链接发给你

JBYhjz commented 9 months ago

你发我一个邮箱吧,我把百度网盘链接发给你

大佬,万分感激!,357578054@qq.com

JBYhjz commented 9 months ago

你发我一个邮箱吧,我把百度网盘链接发给你

大佬,万分感激!,357578054@qq.com

大佬,你有空的时候麻烦发我一份吧. >>辛苦了 (o^^o)

YiFraternity commented 9 months ago

你发我一个邮箱吧,我把百度网盘链接发给你

大佬,万分感激!,357578054@qq.com

大佬,你有空的时候麻烦发我一份吧. >>辛苦了 (o^^o)

已发送

JBYhjz commented 9 months ago

你发我一个邮箱吧,我把百度网盘链接发给你

大佬,万分感激!,357578054@qq.com

大佬,你有空的时候麻烦发我一份吧. >>辛苦了 (o^^o)

已发送

收到大佬🫡,万分感激!!!

JBYhjz commented 9 months ago

你发我一个邮箱吧,我把百度网盘链接发给你

大佬,万分感激!,357578054@qq.com

大佬,你有空的时候麻烦发我一份吧. >>辛苦了 (o^^o)

已发送

大佬,你在使用add_label.py生成标签时候,有没有遇到过TypeError: '<=' not supported between instances of 'datetime.date' and 'float'这个问题,我只能生成部分文件的标签,有小部分文件没发生成标签