Yoctol / purewords

Create pure sentences
3 stars 2 forks source link
data-preprocessing natural-language-processing

purewords

CircleCI pypi

Purewords is a package used to clean raw texts for all languages.

Install

pip install purewords

Usage

Module usage:

  import purewords

  # raw sentence
  inputs = "ha hi!!hello I\'m at http:www.google.com.tw\n\n" 
           + "you know yahoo? my_computer is great. My phone number"
           + "is 02-3366-5678. <br>的啦<br> my password: 123-abc$99&^%Y)\'_\'(Y "

Treat inputs as a sentence and clean it.

Word tokens are splitted with whitespace

  # result: string
  purewords.clean_sentence(inputs)
  'ha hi hello i am at _url_ you know yahoo my computer is great my phone number is _phone_ 的 my password _num_ abc _num_ y y'

Treat inputs as a document and clean it.

Split document with some confident splitting token such as '.' or '?'.

  # result: list of cleaned string
  purewords.clean_document(inputs)
  ['ha hi', 'hello i am at _url_', 'you know yahoo', 'my computer is great', 'my phone number is _phone_', '的 my password _num_ abc _num_ y y']

Customed your purewords

You can use different setting in purewords.

  import purewords
  from purewords.tokenizer import YoctolTokenizer
  from purewords.filter_collection import document_filters
  from purewords.filter_collection import token_filters

  tokenizer = YoctolTokenizer()
  pw = purewords.PureWords(
      tokenizer=tokenizer, # select your tokenizer
      document_filters=document_filters, # select your document filters
      token_filters=token_filters, # select your token filters
      max_len=200, # cut long sentence whose length exceed max_len
      min_len=1 # ignore short sentence 
  )

  inputs = 'This is a sentence.'

  pw.clean_sentence(inputs)
  pw.clean_document(inputs)

Tokenizer

Select your tokenizer in purewords

You can select WhitespaceTokenizer tokenizer if you prefer tokenize sentences with whitespace or JiebaTokenizer for default jieba setting.

Otherwise, we use yoctol jeiba tokenizer as our default setting.

  from purewords.tokenizer import WhitespaceTokenizer

  tokenizer = WhitespaceTokenizer()
  pw = purewords.PureWords(
      tokenizer=tokenizer
  )
Add new words in JiebaTokenizer

You can add new word in JiebaTokenizer to customize your tokenizer.

  from purewords.tokenizer import JiebaTokenizer

  tokenizer = JiebaTokenizer()
  tokenizer.add_word(new_word, freq, tag) # The setting is same with jieba.add_word
  tokenizer.add_words(new_word_list, freq, tag)

  pw = purewords.PureWords(
      tokenizer=tokenizer
  )

Filter collection

You can customize your preprocesing ways in purewords.