dhowe / ChinaEye

Experience the web as if you were living in China...
Artistic License 2.0
9 stars 3 forks source link

Fetch search terms from online list #1

Closed dhowe closed 7 years ago

dhowe commented 8 years ago
  1. load best list from some URL on first install
  2. periodically check for updates to list
  3. add additional search engines
cqx931 commented 8 years ago

List Candidate: https://github.com/jasonqng/chinese-keywords (more about the list:https://citizenlab.org/2014/12/repository-censored-sensitive-chinese-keywords-13-lists-9054-terms/)

dhowe commented 8 years ago

Yes, when I checked a few weeks ago, I also came across this one. Problems: a) there is a lot of data besides the words themselves, and b) it hasn't been updated in over 2 years.

But if you can extract a set of english words that makes sense to use, we can figure out somewhere to host the list

cqx931 commented 8 years ago

I would suggest that we start from using the "no-dummy-vars-for-categories-and-themes_only-sensitive-words.csv" of this source. The best thing about this list is that it is fully translated into English. Gretfire.org has an ongoing list of sensitive keywords on Weibo, but there is no English translation. Do we only need English of the keywords? Or both Chinese and English (which I think makes more sense...)?

dhowe commented 8 years ago

Chinese and English would be ideal, perhaps in pairs? Lets start with the csv list, perhaps using a very simple format like below (note that we may not have English or Chinese for a given phrase):

Chinese phrase 1, English phrase 1
Chinese phrase 2, English phrase 2
Chinese phrase 3,
Chinese phrase 4, English phrase 4
, English phrase 5
Chinese phrase 6, English phrase 6
cqx931 commented 8 years ago

Please refer to the following commits for the following tasks 1.load best list from some URL on first install https://github.com/dhowe/ChinaEye/commit/f9d588bf3a9d5da3c2d27902d19e314682996fb8 3.add additional search engines https://github.com/dhowe/ChinaEye/commit/8f153d75914cd1550e13705d1b902963e1dc1f87

I just realized that I have the wrong remote origin after I pushed...Please let me know whether it is ok to leave it like this for this time, or I can also revert and make a pull request instead.

cqx931 commented 8 years ago

As for the list, I currently host it on my own server for testing. Do you want to host it on rednoise?

The way to handle the trigger: I currently processed both the Chinese and English rule to the triggers. Do you prefer to have only the English ones as triggers or both languages? If Chinese is covered in the trigger, I'll add the url decoder for Chinese characters.

dhowe commented 8 years ago

Why don't we host it on github? And both languages are fine, shouldn't be much overhead...

cqx931 commented 8 years ago

the list is now hosted on Github, please check: https://github.com/dhowe/ChinaEye/pull/3

btw, what do you have in your mind about the "periodically check for updates"?Like check for an update of the list every week or so? I currently just let the list reloaded whenever chrome starts...

dhowe commented 8 years ago

ublock checks every 4 days... but for now, lets store the time of the last check and do an update check whenever chrome starts, unless its less than some amount, like 12 hours

cqx931 commented 8 years ago

update function: https://github.com/dhowe/ChinaEye/pull/4