dotSlashLu / nodescws

scws(Simple Chinese Word Split) node.js binding - scws中文分词node.js模块
13 stars 5 forks source link
node scws

nodescws

scws

About

scws即Simple Chinese Word Segmentation。是C语言开发的基于词频词典的机械式中文分词引擎。scws的作者为hightman,采用BSD许可协议发布。nodescws的作者在libscws上添加功能(包括停用词、忽略符号、json格式配置等)并添加了node.js binding,除自己代码,不持有libscws著作权。

scws的主页: http://www.xunsearch.com/scws, GitHub: https://github.com/hightman/scws


nodescws

Current release: v0.5.1

Install

npm install scws

Usage

var Scws = require("scws");
var scws = new Scws(settings); // NOTE: before v0.5.0, do new Scws.init(settings)
var results = scws.segment(text);
scws.destroy(); // DO NOT forget this or your memory may be corrupted

new Scws(settings)

注意,在v0.5.0之前,使用new Scws.init(settings)初始化。

scws.segment(text)

Return Array

[
    {
        word: '可读性',
        offset: 183, // 该词在文档中的位置
        length: 9, // byte
        attr: 'n', // 词性,采用《现代汉语语料库加工规范——词语切分与词性标注》标准,涵义请参考 http://blog.csdn.net/dbigbear/article/details/1488087
        idf: 7.800000190734863
    },
    ...
]

Example 用例

var fs   = require("fs")
    Scws = require("scws");

fs.readFile("./test_doc.txt", {
  encoding: "utf8"
}, function(err, data){
  if (err)
    return console.error(err);

  // initialize scws with config entries
  var scws = new Scws({
    charset: "utf8",
    //dicts: "./dicts/dict.utf8.xdb:./dicts/dict_cht.utf8.xdb:./dicts/dict.test.txt",
    dicts: "./dicts/dict.utf8.xdb",
    rule: "./rules/rules.utf8.ini",
    ignorePunct: true,
    multi: "duality",
    debug: true
  });

  // segment text
  res = scws.segment(data);
  res1 = scws.segment("大家好我来自德国,我是德国人");

  console.log(res, res1);

  // destroy scws, recollect memory
  scws.destroy();
})

更多请参考test/中的测试

Changelog

v0.5.1

v0.5.0

v0.2.4

v0.2.3

v0.2.2

v0.2.1

You can add your own stop words in the entry [nostats] in the rule file. Turn off stop words feature by setting applyStopWord false.

v0.2.0

New syntax to initialize scws: scws = new Scws(config); result = scws.segment(text); scws.destroy() so that we are able to reuse scws instance, thus gaining great improvement in perfermence when recurrently used(approximately 1/4 faster).

Added new setting entry debug. Setting config.debug = true will make scws output it's log, error, warning to stdout

v0.1.3

Published to npm registry. usage: scws(text, settings); available setting entries: charset, dicts, rule, ignorePunct, multi.

Contributors