TeamWiseFlow / wiseflow

Wiseflow is an agile information mining tool that extracts concise messages from various sources such as websites, WeChat official accounts, social platforms, etc. It automatically categorizes and uploads them to the database.
Other
4.65k stars 778 forks source link

如何固定抓取URL二级页面的内容? #120

Open linopluss opened 2 days ago

linopluss commented 2 days ago

如题,例如我只想抓取https://au.news.yahoo.com 或https://news.yahoo.com/au/ 域名下的内容,而不是整个网站的内容。怎么实现?

bigbrother666sh commented 1 day ago

site 里面直接填二级域名

linopluss commented 22 hours ago

site 里面直接填二级域名

感谢您的回复,我已经尝试使用二级域名,但是还是抓取到了网站的所有内容。例如,抓取到了https://news.yahoo.com/fr/下的内容,我只想抓取澳洲的内容,不想抓取法国的内容, 有什么办法可以实现吗?

linopluss commented 21 hours ago

以下是log,这是我设置的另一个二级域名,目标要抓取的是https://www.midea.com/AU/ 下的内容,

实际抓取到了 https://www.midea.com/us/ https://www.midea.com/de//

2024-11-15 12:12:07 core-1 | 2024-11-15 01:12:07.490 | DEBUG | insights:pipeline:34 - start processing https://www.midea.com/us/ranges/freestanding-ranges-electric 2024-11-15 12:12:08 core-1 | 2024-11-15 01:12:08.240 | DEBUG | insights:pipeline:59 - article: Free Standing Ranges Electric 2024-11-15 12:12:08 core-1 | 2024-11-15 01:12:08.244 | DEBUG | llms.openai_wrapper:openai_llm:22 - messages: 2024-11-15 12:12:08 core-1 | [{'role': 'system', 'content': "Please carefully read the news content provided by the user and analyze it according to the list of type labels given below:\n['fridge', 'refrigerator', 'freezer', 'cooling appliance']\n\nThe meanings of each label are as follows:\nfridgefridge\nrefrigeratorrefrigerators\nfreezer freezers\ncooling appliancecooling appliance\n\nIf the news contains any information of the aforementioned types, please mark the type label of the information using the following format and provide a one-sentence summary containing only the time, location, people involved, and event:\nTypeLabelA one-sentence summary containing only the time, location, people involved, and event\n\nPlease be sure to: 1. Strictly adhere to the original text and do not provide information not contained in the original; 2. For the same event, choose only one most appropriate label and do not repeat the output; 3. If the news contains multiple pieces of information, analyze them one by one and output them in a one-line-per-item format. If the news does not involve any of the types of information, simply output: None."}, {'role': 'user', 'content': 'title: Free Standing Ranges Electric\n\ncontent: [from midea] Our freestanding electric ranges feature are sleek and functional. Each one features a powerful yet flexible cooktop, an edge-to-edge glass oven window, and convenient Easy-Clean technology.'}] 2024-11-15 12:12:08 core-1 | 2024-11-15 01:12:08.244 | DEBUG | llms.openai_wrapper:openai_llm:23 - model: THUDM/glm-4-9b-chat 2024-11-15 12:12:08 core-1 | 2024-11-15 01:12:08.244 | DEBUG | llms.openai_wrapper:openai_llm:24 - kwargs: 2024-11-15 12:12:08 core-1 | {'temperature': 0.1} 2024-11-15 12:12:08 core-1 | 2024-11-15 01:12:08.715 | DEBUG | llms.openai_wrapper:openai_llm:43 - result: 2024-11-15 12:12:08 core-1 | Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='\nNone', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)) 2024-11-15 12:12:08 core-1 | 2024-11-15 01:12:08.716 | DEBUG | llms.openai_wrapper:openai_llm:44 - usage: 2024-11-15 12:12:08 core-1 | CompletionUsage(completion_tokens=2, prompt_tokens=312, total_tokens=314, completion_tokens_details=None, prompt_tokens_details=None) 2024-11-15 12:12:08 core-1 | 2024-11-15 01:12:08.717 | DEBUG | insights.get_info:get_info:94 - can not find info, llm result: 2024-11-15 12:12:08 core-1 | 2024-11-15 12:12:08 core-1 | None 2024-11-15 12:12:08 core-1 | 2024-11-15 01:12:08.718 | INFO | insights:pipeline:32 - https://www.midea.com/content/dam/midea-aem/de/presse/20240529-Pressemitteilung-Midea-Golden-Hygiene-Label-B2C.pdf is a file, skip 2024-11-15 12:12:08 core-1 | 2024-11-15 01:12:08.718 | DEBUG | insights:pipeline:34 - start processing https://www.midea.com/de/support/FAQ 2024-11-15 12:12:09 core-1 | 2024-11-15 01:12:09.120 | INFO | scrapers.general_crawler:general_crawler:98 - can not reach

bigbrother666sh commented 12 hours ago

那就是二级域名下面有到其他二级域名的超链…… 你可以在tags里面通过 tag 限定只提取澳洲相关的内容

bigbrother666sh commented 12 hours ago

或者自定义专有网站信息提取器