UglyToad / PragmaticSegmenterNet

Port of PragmaticSegmenter for sentence boundary detection
Other
32 stars 12 forks source link

Porting to Python #3

Closed nipunsadvilkar closed 4 years ago

nipunsadvilkar commented 5 years ago

Hey @EliotJones glad that you ported it to C#.

I wish to port it to Python. I would love to know your approach in porting one library to another.

What steps did you take to understand ruby source code and what challenges you faced?

Also, would love to have your feedback on how should I go about porting it to Python.

Thanks, Nipun

EliotJones commented 5 years ago

Hi,

Thanks for your question.

My advice would be to go for the minimum possible thing you can put a test around first.

The main logic for the Ruby code doing the actual segmentation is in https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/processor.rb. You don't need every function call in there for the first pass, you can also skip anything related to cleaning/formatting the input text.

The original Pragmatic Segmenter tests are very good and will definitely help you write the port. I'd focus on getting the first one to pass before going wide and converting absolutely everything. https://github.com/diasks2/pragmatic_segmenter/blob/master/spec/pragmatic_segmenter/languages/english_spec.rb

For porting the tests I actually wrote a small command line application which used regexes to port them since the Ruby <-> C# conversion just required replacing certain keywords with the C# equivalent. You might be able to do something similar with Python.

Also only port a single language to start with, I chose English since it's the one I'm most familiar with, but most of the languages become copy and paste once you've implemented the first one.

For the actual writing process, my method was to copy the Ruby code, a method at a time and then get it to compile in C# by writing over it, focusing on a method at a time prevented the problem from being overwhelming and kept my need to learn bits of Ruby restricted to a certain scope.

I decided to make my segmenter port use entirely static/pure functions which made it easier for me to understand. Once you've Googled a few things Ruby is fairly easy to understand when coming from another language, though the unless statement still trips me up! I also ended up installing Ruby and using Visual Studio code with the Ruby plugin to debug the places where the logic didn't make sense to me.

Since the majority of the segmenter is just application of regexes the amount of Ruby you actually need to understand is limited, I didn't know any Ruby when I started but just Googled the syntax where it wasn't immediately obvious. The other thing that helps now I guess is you can also look at the C# code which (should) do the same thing, or generate the same outcome, as the Ruby code.

I've not done any Python for a long time so I'm probably not the best person for Python advice but it looks like it has a fairly decent standard library for dealing with regexes so porting should be possible.

Let me know if you run into any bits that are confusing and I'll see if I can remember, though it was a while ago now.

nipunsadvilkar commented 5 years ago

Thanks a lot for such a descriptive reply.

I have also installed a ruby plugin for VS Code, debug mode helps to understand methods by stepping over each line and yes I also found the functionality of unless in ruby a bit confusing.

I have started the development let's see how it pans out. Thanks again, appreciate your help.

nipunsadvilkar commented 4 years ago

Hey @EliotJones, I just wanted to let you know that I have successfully ported it to Python. Have named it pySBD - python Sentence Boundary Disambiguation (SBD). Currently v0.1.1 only supports English language. will release support for other languages soon.

Do check it out and let me know. Thanks for the above inputs 😄

EliotJones commented 4 years ago

Hi @nipunsadvilkar sorry I never got round to responding. I've had a look and it looks very impressive. I'm glad you were able to make the port successfully and it seems like you have an active and growing community.

chopinml commented 3 years ago

Hello @EliotJones and @nipunsadvilkar ,

Thank you for your efforts to port Ruby library to C# and Python.

Do you see any benefit it to port JavaScript (node.js) library as well? And I wonder two things

1) Is main Ruby Pragmatic Segmenter repository being updated frequently ? 2) Do you still watch the main Ruby repository so also port changes to your ported versions ?

Congrats for your effort !

EliotJones commented 3 years ago

Hi @chopinml, I don't see any reason not to port it to JS, having packages providing sentence detection helps out every ecosystem so there's no reason not to, though they're obviously in higher demand in Python than C# but I think there are more people using JS for things like ML/NLP than in .NET.

  1. My understanding is outside of any essential maintenance the main repository is not being updated frequently, but not because it has been abandoned so much as there's not much to do and the maintainer has other things they're working on. Outside of people submitting issues with specific examples of problematic text the library is basically "complete".
  2. I don't, for much the same reason, unless people have examples where it fails badly, the English language doesn't change so much for thing like sentence detection and I don't speak other languages so unless people want to submit PRs then the project is effectively "finished". However it looks like @nipunsadvilkar's pySBD is more active so it is probably making good progress on other languages and new development.