abitdodgy / gibran

Gibran is an Elixir natural language processor, and a port of WordsCounted.
http://hexdocs.pm/gibran
65 stars 3 forks source link

Soundex #10

Closed GeoffreyPS closed 7 years ago

GeoffreyPS commented 8 years ago

@abitdodgy See below for an implementation of Soundex. Let me know if I'm missing any test coverage you'd like or if there's anything else in conflict with the project.

Limitations / caveats:

At this point, the Soundex implementation here can handle diacritic marks as well ("ñ" becomes "n") , but not special characters like converting Eszett ("ß") into two S characters. I think we would have to include another library like Codepagex or an interface into iconv to handle other characters.

GeoffreyPS commented 8 years ago

I see the build failed and I pinpointed the cause: the new Soundex module makes use of String.normalize/2, which was added to Elixir in version 1.2. Since Travis tests on versions 1.0.5 and 1.1.0, Elixir doesn't have this function included.

Would it be appropriate to include version 1.2 (or 1.3) into Travis? This would also require mix.exs to bump its requirement from Elixir ~1.0 to ~1.2.

abitdodgy commented 8 years ago

@GeoffreyPS hi there, any idea why travis is failing?

abitdodgy commented 8 years ago

@GeoffreyPS sorry about that, I just your comment. Yes, it's a good idea to upgrade the versions on Travis. I only need to see about #8 first...

abitdodgy commented 8 years ago

@GeoffreyPS and thanks for the pull request. 👍🏻

GeoffreyPS commented 8 years ago

Understood! That makes sense with the hash dicts vs. Map.

I might be able to work on it this weekend.

Any chance you're at ElixirConf this week?

Cheers, Geoff

On Aug 31, 2016 9:47 AM, "Mohamad El-Husseini" notifications@github.com wrote:

@GeoffreyPS https://github.com/GeoffreyPS and thanks for the pull request. 👍🏻

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/abitdodgy/gibran/pull/10#issuecomment-243769282, or mute the thread https://github.com/notifications/unsubscribe-auth/AG4R915U1nVHzndV0uGhjEe7N1b_jOHxks5qlYYCgaJpZM4JY3Rh .

abitdodgy commented 8 years ago

@GeoffreyPS I wish. I'm currently deep into the inner workings of corporate machinery, and I'm not doing much coding. More architecture and management work. Unfortunately, not time for conferences at the momeny. I'm getting all nostalgic just writing this. 🤔😄

GeoffreyPS commented 8 years ago

Ah! I'm sorry to hear that. Jose's keynote was about the new GenStage + Flow, which pretty much enables concurrent (but not distributed yet) Parallel MapReduce in an OTP compliant way, without overloading your machine. The example he gave to help people understand the concept was tokenizing and frequency counting about 2gb of text. IT might be a good route to go if we want to scale this tokenizer to handle a large corpus quickly.

I'll look more into it since it's still experimental but nearly ready for release.

I'm going to look into updating HashDict to Maps this weekend for the Tokenizer. I'll send a PR if I can get it worked out.

Cheers,

On Fri, Sep 2, 2016 at 12:58 PM, Mohamad El-Husseini < notifications@github.com> wrote:

@GeoffreyPS https://github.com/GeoffreyPS I wish. I'm currently deep into the inner workings of corporate machinery, and I'm not doing much coding. More architecture and management work. Unfortunately, not time for conferences at the momeny. I'm getting all nostalgic just writing this. 🤔 😄

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/abitdodgy/gibran/pull/10#issuecomment-244430837, or mute the thread https://github.com/notifications/unsubscribe-auth/AG4R9ynVlH9ZxeXVBcZmdbFNHVV-OdMVks5qmFXMgaJpZM4JY3Rh .

Geoff Smith 830.832.2509

abitdodgy commented 8 years ago

@GeoffreyPS thanks for the briefing on ElixirConf. I'm getting a bit restless since I haven't gotten my hands dirty with code for three months.

GeoffreyPS commented 8 years ago

Thanks for the follow up. I hope things improve for you on your end.