ambuda-org / vidyut

Infrastructure for Sanskrit software. For Python bindings, see `vidyut-py`.
48 stars 21 forks source link

Support a `classify_all` method #96

Open akprasad opened 6 months ago

akprasad commented 6 months ago

Problem

Chandas contains a classify method that returns a single result. But sometimes it is interesting to look at all possible results and let the user decide how to use them.

Proposed solution

Create a classify_all method that returns all padyas (vrttas + jatis) that don't have a None match.

MSSRPRAD commented 6 months ago

How about changing the classify method entirely into a classify_vrtta method that returns a single result (which now also has a quality parameter). We can have an analogous classify_jati method for Jati metres.

pub fn classify_vrtta(&self, text: impl AsRef<str>) -> MatchResult {}
pub fn classify_jati(&self, text: impl AsRef<str>) -> MatchResult {}

classify_all can then just return the best n results of Partial Matches by updating them in some data structure while iterating:

pub fn classify_all(&self, text: impl AsRef<str>, best_n: usize) -> Vec<MatchResult> {}
pub struct MatchResult {
    matched: Option<Matched>,
    match_type: Vec<MatchType>,
    aksharas: Vec<Vec<Akshara>>,
}

pub enum Matched {
    Vrtta(Vrtta),
    Jati(Jati),
}

pub enum MatchType {
    Partial(usize), // Representing Quality of match
    Prefix,
    Pada,
    Full,
}

@akprasad

akprasad commented 6 months ago

Thanks for the suggestions!

Having both classify_vrtta and classify_jati feels too complex -- an end user won't necessarily know whether the meter is a vrtta or jAti, and we should make their life easier with a simple classify endpoint so that they don't have to make two calls.

best_n is a nice idea but puts us in the position of having to decide what "best" means, but this seems dependent on the use case. classify_all without best_n is less sophisticated, but the return data is also more obvious and deterministic.

I do like the idea of putting usize and other fields on MatchType, e.g. Prefix(usize) to store the length of the prefix match. Partial is unclear to me. Perhaps MatchType should be split into a Scope (prefix, pada, full) and a Type (exact, EditDistance(usize), ...)

MSSRPRAD commented 6 months ago

Having both classify_vrtta and classify_jati feels too complex -- an end user won't necessarily know whether the meter is a vrtta or jAti, and we should make their life easier with a simple classify endpoint so that they don't have to make two calls.

Can't classify_all to be that endpoint? best_n can be defaulted to 1 (and first the Jati match is checked before vrtta match)?

best_n is a nice idea but puts us in the position of having to decide what "best" means, but this seems dependent on the use case. classify_all without best_n is less sophisticated, but the return data is also more obvious and deterministic.

I felt when the database consists of several hundreds of schemes, it would be apt to ask for the best 5 or so than a match against every single scheme in the data. (If we're providing % matches it might match against many meters)

akprasad commented 6 months ago

best_n can be defaulted to 1

Sadly Rust doesn't support default arguments, so there's no way to both have a default value and reduce friction for the user. (There are workarounds like having an Option<usize>, but then the caller needs to pass None explicitly.)

when the database consists of several hundreds of schemes

Good point!

So perhaps:

pub fn classify(&self, text: impl AsRef<str>) -> Matches {
  self.classify_all(text.as_ref(), 1)
}

pub fn classify_all(&self, text: impl AsRef<str>, n: usize) -> Matches { ... }

pub struct Match {
  // TODO: vrtta/jati, match type, etc.
}

pub struct Matches {
    matches: Vec<Match>,
    aksharas: Vec<Vec<Akshara>>,
}
MSSRPRAD commented 6 months ago

pub fn classify(&self, text: impl AsRef) -> Matches { self.classify_all(text.as_ref(), 1) }

What kind of response struct should be returned? If classify is just returning one response, should it be the one giving Pada match / Prefix Match / None(edit_distance or any other metric) in case there is not Full match?

akprasad commented 6 months ago

How about something like:

pub enum Padya {
  Vrtta(Vrtta),
  Jati(Jati),
}

// I don't like this name
pub enum MatchStrategy {
  Exact,
  EditDistance(usize)
}

pub enum MatchScope {
  None,
  Prefix,
  Pada,
  Full,
}

pub struct Match {
  pada: Padya,
  strategy: MatchStrategy,
  scope: MatchScope,
}

pub struct Matches {
  matches: Vec<Match>,
  aksharas: Vec<Vec<Akshara>>,
}

Then both methods can return a Matches. classify can return a Matches where the matches field has length at most 1.