Closed dhowe closed 7 years ago
How about combining these lists from citizenlab? https://github.com/citizenlab/chat-censorship Except the content in 'TOM-Skype--Sina-UC' and 'LINE', which are already included in jasonqng's list, other lists are mainly new research results from the past two years, with a focus on different platforms and social media.
Shall we combine these into our own list, inside ChineEye ?
yes, that's what I mean. I can combine the lists and update the wiki afterwards.
Lists that we will combine with the current lists:
svp list and livestream list1 are same lists. This list is huge compared to the previous three lists (17044 entries), and there are a few issues that I'm not sure about:
According to the paper describing the list, 7,371 entries are URLs and URL fragments. Shall we ignore these entries when we combine the list?
For the rest 10k entries, some of them are very long and are more like a sentence rather than simple keywords. I think this is because that the list is for social video platforms, so they are likely to be designed for blocking comments. When the entry is long, the result from machine translation turns out to be worse and no one would search in English in the exact way... What's your opinion on these long entries? Shall we include them or set some rules to limit the length of search entry?
Example:"六部口追杀学生撤离队伍的三辆坦克" Machine translation: "Mouth kill six student teams to evacuate three tanks"
perhaps we should include only searches of 3 words or less?
I'm not sure about the criteria here... Three words or less makes sense to me for English Translation, but it is hard to define the same thing in Chinese. I would suggest that we
/\p{Han}+/u
)I also have all the lists in separate files(*_full.txt
for the full lists after filtering out the url, and the other one is a shortened version with the criteria above). Shall I upload them somewhere as a backup?
One exception is the keywords from 709 crackdown, there are many long keyword combination in that folder so I still keep those entries even if the Chinese characters are more than 7.
the list backup is here: https://github.com/dhowe/ChinaData
Updated FAQ: https://github.com/dhowe/ChinaEye/wiki#how-does-chinaeyes-search-keyword-testing-work, please check
For example: https://citizenlab.org/2017/04/we-cant-chat-709-crackdown-discussions-blocked-on-weibo-and-wechat/
@jasonqng's list has not been updated in quite some time