Question - Githubissues

Firstly, this library looks great!

Secondly, I'm experimenting with some ML for a little side project of mine to help me learn and just wondering what I could achieve with this library / point me in the right direction as you seem extremely knowledgable on the subject =]

[
  'Where could i get a hot drink in Manchester?',
  'whats the best coffee shop in manchestr',
  'coolest cafe in Manc?'
]

So for my examples you can see the various ways (spelling mistakes intentional) of asking about a Cafe in a named location.

As you can see there is lots of ways X activity can be asked about. At most I'd have say 6 different types of activities (i.e going to a cafe, a park, visiting a gym) so the options there are limited, but lots of ways of describing them.

Second part of the query is a location, sometimes an abbreviated location, I have a list of all the possible locations and subsequent abbreviations/aliases, how could i extract these as well as handling spelling mistakes/typos with the potential that multiple locations could be mentioned in the same sentence, at most 3 - 4 locations.

The intent classifier seemed like a good start and basic tests seemed to work, but unsure how to handle the issues above, i.e spelling (string distance perhaps?) and named locations?

Any pointers, examples etc would be appreciated

Thanks, Mike

Hi Mike,

Your question is not trivial. I did some research on this in the past. Now, Vasily (CCed) is continuing the research. Here is what I remember.

A. Regarding the spelling mistakes, it is possible to use a speller, such as this: https://github.com/mrmarbles/wordsworth To connect it to limdu, you can use a normalizer: https://github.com/erelsgl/limdu#input-normalization The normalizer is a function that takes a word and returns the same word with spelling-mistakes corrected, according to the training set. In my research, using a speller had little effect on the performance, so I dropped it. The reason is, probably, that you need to collect a lot of training data anyway, and after you collect e.g. 1000 training sentences, most common spelling-mistakes are already in the train-set. But, you can try and see whether it works for you.

B. Regarding named locations, this is more complicated. The two main approaches that I know of are:

Rule-based: use manually-written rules to detect the locations. Replace them with a common term, such as "LOCATION". Then, use the classifier to detect the intent. Then, put the location back into the intent. This may work for small problems, but it is not very scalable.
The more common approach is "sequence classification". Instead of classifying each sentence, you classify each word or sequence of words. It works good in larger applications, but requires a lot of training effort.
You can also read about "information retrieval" or "information extraction". There are several methods and one of them may fit your needs.

-- Erel

On Sun, Nov 15, 2015 at 12:40 AM, Michael Diarmid notifications@github.com wrote:

Firstly, this library looks great!

Secondly, I'm experimenting with some ML for a little side project of mine to help me learn and just wondering what I could achieve with this library / point me in the right direction as you seem extremely knowledgable on the subject =]

[ 'Where could i get a hot drink in Manchester?', 'whats the best coffee shop in manchestr', 'coolest cafe in Manc?' ]

So for my examples you can see the various ways (spelling mistakes intentional) of asking about a Cafe in a named location.

As you can see there is lots of ways X activity can be asked about. At most I'd have say 6 different types of activities (i.e going to a cafe, a park, visiting a gym) so the options there are limited, but lots of ways of describing them.

Second part of the query is a location, sometimes an abbreviated location, I have a list of all the possible locations and subsequent abbreviations/aliases, how could i extract these as well as handling spelling mistakes/typos with the potential that multiple locations could be mentioned in the same sentence, at most 3 - 4 locations.

The intent classifier seemed like a good start and basic tests seemed to work, but unsure how to handle the issues above, i.e spelling (string distance perhaps?) and named locations?

Any pointers, examples etc would be appreciated

Thanks, Mike

— Reply to this email directly or view it on GitHub https://github.com/erelsgl/limdu/issues/33.

erelsgl / limdu

Question #33