WorksApplications / sudachi.rs

Sudachi in Rust 🦀 and new generation of SudachiPy
Apache License 2.0
301 stars 34 forks source link

Public Rust API #28

Open eiennohito opened 2 years ago

eiennohito commented 2 years ago

We want to design public API so Sudachi would be usable like the following. Syntax can be a bit invalid and all names are open for discussion.

let model = JapaneseModel::from_cfg("...")?;
let mut analyzer = model.new_analyzer();

for line in data {
  for sentence in analyzer.analyze(sentence)? {
    for token in sentence {
      println!(token.surface);
    }
  }
}

Key points of API

Because of Python API and lifetime considerations, Model should be a thin wrapper on Arc<RealModel> or something like that.

Layering

We have Rust API and Python API with different lifetime considerations. Rust API should use lifetimes to safeguard against misuse and use mostly references for sharing data. On the other hand Python can't use Rust lifetimes and should use mostly Arc for sharing data.

Design proposal here is to have pointer-generic internals with thin wrappers for API types which mostly exist for instantiating concrete types.

API Surface (Types)

eiennohito commented 2 years ago

Names and semantics should be close to Java version as possible. (comment from Takaoka-san)

eiennohito commented 2 years ago

TL:DR

Nice API with multiple sentences is currently blocked in stable Rust by in-progress GATs feature, also see http://lukaskalbertodt.github.io/2018/08/03/solving-the-generalized-streaming-iterator-problem-without-gats.html.

Scratch prototype impl

Want to have:

Problems:

What to do

eiennohito commented 2 years ago

Splitting API into sentence splitter / analysis

for sentence in analyzer.split_sentences(line)? {
 let result = analyzer.analyze_sentence(sentence)?
 for token in result.tokens() {
   // process token
 }
}
eiennohito commented 2 years ago

Morpheme's part_of_speech should not return option of POS array, it should panic when given invalid POS id instead.