Closed MrOrz closed 5 years ago
https://github.com/ageitgey/node-unfluff/blob/master/README.md This definitely won't work yet for languages like Chinese / Arabic / Korean / etc that need smarter word tokenization.
https://github.com/craftzdog/extract-main-text-node 有網友說在 cjk 他會比 unfluff 穩定
https://github.com/inspiredjw/oembed-auto some media sites may support oembed
Webpage loading & rendering 可以直接用 headless chrome 來處理 SPA,還可以順便截張圖 https://github.com/GoogleChrome/puppeteer#readme
或者是用 Rendertron,不用接 API,直接起 docker 做 prerender https://github.com/GoogleChrome/rendertron
From slack https://g0v-tw.slackarchive.io/cofacts/page-17/ts-1506900396000054
滿厲害的可以處理中文: http://fivefilters.org/content-only/
處理方式:http://www.keyvan.net/2011/03/content-extraction/
Python goose3 也可以處理中文: https://github.com/goose3/goose3#goose-in-chinese
Goose3 seems really promising! We just need to provide URL. How neat!
As for URL normalization, we can rely on: https://github.com/g0v/url-normalizer.js and canonical URL field
But this should be not very important, since we mostly uses its content to do matching.
Pure goose3 test:
https://docs.google.com/spreadsheets/d/1y1GGc04HBhpU76D6LvX5hqt877X_Rfatc1vSQNJmZ6Q/edit#gid=0
utm_source
s, but there may be some sites not implementing canonical urls correctlycleaned_text
cleaned_text
Another possible alternative is Boilerpipe in Java. Although it's old, it do have an API, and supports Chinese.
Here are the comparison between goose and boilerpipe: https://gist.github.com/eldilibra/5637215
From the ycombinator article https://news.ycombinator.com/item?id=2526127 , we also have the following candidates:
Mozilla/Readability + Puppeteer test results:
test script: https://gist.github.com/MrOrz/fb48f27f0f21846d0df521728fda19ce (Pure JS!)
Quote my feedback from slack:
雖然似乎是比 goose3 多抽出了一些垃圾來(首尾會多出一些日期之類的;mygopen的主文不知道為什麼重複了兩次 @@),但就 indexing 的需求來說我覺得可以接受耶。
這個 JS solution 我覺得夠好了, 現在沒有動力測其他 python based solution XD
youtube 的 extraction 還是很糟,但 youtube 連結偏偏又很多。看來需要針對 youtube 做特化 @@
However, there is one web page that will trigger Readability's bug. Maybe we should report that.
https://hackmd.io/s/SyqhWqLKz#Proposal
urls
index (or just return cached result; or check cache first before fetching)articles
and replies
using scripting updates.urls
from article & replies (also fills in their own hyperlinks
field), given a range of date to scan.urls
indexCreateArticle
, CreateReply
:
scrapUrls
with cache turned on. After fetch, insert newly searched URL to the urls
index.hyperlinks
when scrapUrls
returned.ListArticles
+ moreLikeList
filter
scrapUrls
with cache turned on. After fetch, insert newly searched URL to the urls
index.scrapUrls
1 and 2 are always used together, consider implementing them into scrapUrl()
Migration script
For each article & reply, perform CreateArticle
's 1~3.
With #97, #98 and #104 deployed to production, we can now close this 🎉
想要解決的問題:
如果可以記錄每個連結的
一併加入 Full text search 的話會方便很多。