dhowe / ChinaEye

Experience the web as if you were living in China...
Artistic License 2.0
9 stars 3 forks source link

Include more current keywords in search lists #68

Closed dhowe closed 7 years ago

dhowe commented 7 years ago

For example: https://citizenlab.org/2017/04/we-cant-chat-709-crackdown-discussions-blocked-on-weibo-and-wechat/

@jasonqng's list has not been updated in quite some time

cqx931 commented 7 years ago

How about combining these lists from citizenlab? https://github.com/citizenlab/chat-censorship Except the content in 'TOM-Skype--Sina-UC' and 'LINE', which are already included in jasonqng's list, other lists are mainly new research results from the past two years, with a focus on different platforms and social media.

dhowe commented 7 years ago

Shall we combine these into our own list, inside ChineEye ?

cqx931 commented 7 years ago

yes, that's what I mean. I can combine the lists and update the wiki afterwards.

cqx931 commented 7 years ago

Lists that we will combine with the current lists:

svp list and livestream list1 are same lists. This list is huge compared to the previous three lists (17044 entries), and there are a few issues that I'm not sure about:

  1. According to the paper describing the list, 7,371 entries are URLs and URL fragments. Shall we ignore these entries when we combine the list?

  2. For the rest 10k entries, some of them are very long and are more like a sentence rather than simple keywords. I think this is because that the list is for social video platforms, so they are likely to be designed for blocking comments. When the entry is long, the result from machine translation turns out to be worse and no one would search in English in the exact way... What's your opinion on these long entries? Shall we include them or set some rules to limit the length of search entry?

Example:"六部口追杀学生撤离队伍的三辆坦克" Machine translation: "Mouth kill six student teams to evacuate three tanks"

dhowe commented 7 years ago

perhaps we should include only searches of 3 words or less?

cqx931 commented 7 years ago

I'm not sure about the criteria here... Three words or less makes sense to me for English Translation, but it is hard to define the same thing in Chinese. I would suggest that we

  1. Delete the English translation entry if it consists of more than 3 words, the Chinese entry might still makes sense in this case
  2. If an entry contains more than 7 Chinese characters, remove the entry completely. (find Chinese characters with this regex: /\p{Han}+/u)
cqx931 commented 7 years ago

I also have all the lists in separate files(*_full.txt for the full lists after filtering out the url, and the other one is a shortened version with the criteria above). Shall I upload them somewhere as a backup?

image

One exception is the keywords from 709 crackdown, there are many long keyword combination in that folder so I still keep those entries even if the Chinese characters are more than 7.

cqx931 commented 7 years ago

the list backup is here: https://github.com/dhowe/ChinaData

cqx931 commented 7 years ago

Updated FAQ: https://github.com/dhowe/ChinaEye/wiki#how-does-chinaeyes-search-keyword-testing-work, please check