MrOrz commented 7 years ago

想要解決的問題：

Retrieval 時遇到連結基本上就不能做任何事，即使連結背後的文章高度相關也無法找到。
Line bot 顯示找到的文章給使用者選時，對連結相當不友善
編輯要點進去看有點麻煩，而且 related article 對連結的效果不彰

如果可以記錄每個連結的

Title
Canonical URL (after redirect)
Content

一併加入 Full text search 的話會方便很多。

MrOrz commented 7 years ago

https://github.com/ageitgey/node-unfluff/blob/master/README.md This definitely won't work yet for languages like Chinese / Arabic / Korean / etc that need smarter word tokenization.

https://github.com/craftzdog/extract-main-text-node 有網友說在 cjk 他會比 unfluff 穩定

MrOrz commented 7 years ago

https://github.com/inspiredjw/oembed-auto some media sites may support oembed

MrOrz commented 7 years ago

Webpage loading & rendering 可以直接用 headless chrome 來處理 SPA，還可以順便截張圖 https://github.com/GoogleChrome/puppeteer#readme

或者是用 Rendertron，不用接 API，直接起 docker 做 prerender https://github.com/GoogleChrome/rendertron

MrOrz commented 6 years ago

From slack https://g0v-tw.slackarchive.io/cofacts/page-17/ts-1506900396000054

滿厲害的可以處理中文： http://fivefilters.org/content-only/

處理方式：http://www.keyvan.net/2011/03/content-extraction/

Python goose3 也可以處理中文： https://github.com/goose3/goose3#goose-in-chinese

MrOrz commented 6 years ago

Goose3 seems really promising! We just need to provide URL. How neat!

2017-11-11 12 55 14

MrOrz commented 6 years ago

As for URL normalization, we can rely on: https://github.com/g0v/url-normalizer.js and canonical URL field

But this should be not very important, since we mostly uses its content to do matching.

MrOrz commented 6 years ago

Pure goose3 test:

https://docs.google.com/spreadsheets/d/1y1GGc04HBhpU76D6LvX5hqt877X_Rfatc1vSQNJmZ6Q/edit#gid=0

Failed to resolve some of the url shortener. Maybe it contains multiple redirects?
Youtube contains no server rendered stuff (other than meta description). It contains no cleaned_text.
Canonical URL works for removing utm_sources, but there may be some sites not implementing canonical urls correctly
weibo contents cannot be resolved.
Layout in some weixin articles will cause some text recognized as figure captions and not being included in cleaned_text
When chinese stopwords are used, we cannot extract english documents' cleaned_text

MrOrz commented 6 years ago

Another possible alternative is Boilerpipe in Java. Although it's old, it do have an API, and supports Chinese.

Here are the comparison between goose and boilerpipe: https://gist.github.com/eldilibra/5637215

From the ycombinator article https://news.ycombinator.com/item?id=2526127 , we also have the following candidates:

mozilla's Readability, which drives Firefox & Safari's reader mode, and is IN JS!: https://github.com/mozilla/readability - I think we can load it directly into the page and invoke it through puppeteer
Dragnet is an open-sourced machine-learning based solution with better performance than goose3
we can also create our own rules and code it using Fathom

MrOrz commented 6 years ago

Mozilla/Readability + Puppeteer test results:

https://docs.google.com/spreadsheets/d/1y1GGc04HBhpU76D6LvX5hqt877X_Rfatc1vSQNJmZ6Q/edit#gid=1885459841

test script: https://gist.github.com/MrOrz/fb48f27f0f21846d0df521728fda19ce (Pure JS!)

Quote my feedback from slack:

雖然似乎是比 goose3 多抽出了一些垃圾來（首尾會多出一些日期之類的；mygopen的主文不知道為什麼重複了兩次 @@），但就 indexing 的需求來說我覺得可以接受耶。

這個 JS solution 我覺得夠好了，現在沒有動力測其他 python based solution XD

youtube 的 extraction 還是很糟，但 youtube 連結偏偏又很多。看來需要針對 youtube 做特化 @@

However, there is one web page that will trigger Readability's bug. Maybe we should report that.

MrOrz commented 6 years ago

Requirement

https://hackmd.io/s/SyqhWqLKz#Proposal

To be implemented

an utility function, given an URL, returns the fetched result and write new entries in urls index (or just return cached result; or check cache first before fetching)
After fetching an URL, update title & summary for all articles and replies using scripting updates.
a script that populates urls from article & replies (also fills in their own hyperlinks field), given a range of date to scan.
a mechanism to override / import urls index

MrOrz commented 6 years ago

CreateArticle, CreateReply:

Extract URLs from string fields
Invoke scrapUrls with cache turned on. After fetch, insert newly searched URL to the urls index.
write to hyperlinks when scrapUrls returned.

ListArticles + moreLikeList filter

Extract URLs from string fields
Invoke scrapUrls with cache turned on. After fetch, insert newly searched URL to the urls index.
perform search with summary returned by scrapUrls

1 and 2 are always used together, consider implementing them into scrapUrl()

Migration script For each article & reply, perform CreateArticle's 1~3.

MrOrz commented 5 years ago

With #97, #98 and #104 deployed to production, we can now close this 🎉

cofacts / rumors-api

Index the title and content of URLs in page #41

Requirement

To be implemented