ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

Add witness collapsing transformation #39

Closed ajenhl closed 9 years ago

ajenhl commented 9 years ago

Witness data tends to add a lot of duplication to results, where most or all witnesses to a text have the same count for a given n-gram. Add a command to tacl-helper (not tacl report, because this changes the format of the results) to collapse such 'duplicate' rows together, removing the siglum column and adding a sigla column that contains a space separated list of sigla. For example:

ngram,size,text name,siglum,count,label
一切常,3,T0006,base,1,P
一切常,3,T0006,大,1,P
一切常,3,T0006,元,2,P
一切常,3,T0006,宋,2,P

is changed to:

ngram,size,text name,sigla,count,label
一切常,3,T0006,base 大,1,P
一切常,3,T0006,元 宋,2,P