Simmetrics / simmetrics

Similarity or Distance Metrics, e.g. Levenshtein, for Java
Apache License 2.0
340 stars 77 forks source link

How can see the result of simpliers/tokenizers on Strings rather than just result #26

Open ijabz opened 8 years ago

ijabz commented 8 years ago

So may typically have@

 StringMetric metric = with(new CosineSimilarity<String>())
                .simplify(Simplifiers.toLowerCase())
                .simplify(Simplifiers.removeDiacritics())
                .simplify(new SpecialReplacementsSimplifier())
                .tokenize(Tokenizers.whitespace())

float result = metric.compare(s1,s2)

What I would like to do for debugging is an easy way to see the final step before the cosine similarity, i,e the contents of the sets created by applying the simplifiers and then finally the tokenizer(s), is this possible ?

mpkorstanje commented 8 years ago

Sure. You can put a break point in CosineSimilarity.java at line 62.

Or if you want to log what goes in, the builder relies on interfaces rather then concrete implementations so you can wrap the metric in your own metric.

But I think you should write unit tests to validate if your SpecialReplacementsSimplifier works as it should rather then visual inspection.

MultisetMetric<String> loggingMetric = new MultisetMetric<String>() {

    final CosineSimilarity<String> cos = new CosineSimilarity<>();

    @Override
    public float compare(Multiset<String> a, Multiset<String> b) {
        System.out.println("CosineSimilarity [");
        System.out.println("a: " + a);
        System.out.println("b: " + a);
        System.out.println("]");
        return cos.compare(a,b);
    }
};

StringMetric metric = with(loggingMetric)
        .simplify(Simplifiers.toLowerCase())
        .simplify(Simplifiers.removeDiacritics())
        .simplify(new SpecialReplacementsSimplifier())
        .tokenize(Tokenizers.whitespace())
        .build();
ijabz commented 7 years ago

Thanks that works, but Ideally I would like it to output the two original strings well. Of course I can output these myself before making the compare call, but in a multithreaded system other calls may get interleaved. I wanted this to check my whole simmetrics stack, access to the tokenized sets (as you ve shown me above) is needed to write unit tests anyway

mpkorstanje commented 7 years ago

Then you shouldn't use the builder. Its design relies on being indifferent towards the individual components as long as they adhere to their interface.

ijabz commented 7 years ago

If you say so, though it would seem quite useful to have a way of seeing the effects of a builder on some inputs without having to break down the individual steps.

mpkorstanje commented 5 years ago

What would you do with this information?