Adding Lancaster Stemmer, Vowels utility class and IDataReaderAdapter for importing data sets

angeloskath commented 11 years ago

Again, I appreciate your work but I will unfortunately have to ask you to wait a bit for me. I am in the process of moving most tests to phpunit and deciding which of my current tests are actually functional instead of unit tests and how I am going to handle them. After that, I will also reformat the code with php coding standards fixer and then I will create a develop branch.

Now regarding the pull request.

Why keep the rules in a different json file instead of keeping them in code like most Lancaster Stemmer implementations?
What is the point of the IDataReaderInterface? NlpTools has Document and TrainingSet to abstract away the actual data (training data, test data). PHP has Iterator interface. I would suggest if you want to add a json reader simply to add a JsonDocument. Although if you look at the first point, what I am really suggesting is to not have a file dependency at all.
NlpTools\Utils\Vowel should really be NlpTools\Utils\EnglishVowels. I think PorterStemmer uses the same concept of vowels so, although otherwise I would think that such a small abstraction is an overkill, I believe it could also be used elsewhere in the system.
This one is something you wouldn't possibly know (not pushed yet), I am adding the tests in the same namespace as the class that is being tested, for instance PennTreeBankTokenizerTest is now in NlpTools\Tokenizers namespace.

So to sum up, I will push the develop branch in due time. Branch off of there, apply the above changes (or argue with me :-) ) and then I 'll merge that one.

yooper commented 11 years ago

Hello,

I have no issues with you questioning my design decisions and implementation. Let me try to answer your questions:

I prefer to keep the rules in a separate file because other developers, adopters of nlp tools may have custom rulesets. By decoupling the ruleset from the algorithm implementation others can pass in their existing ruleset without needing to extend the Lancaster Stemmer class.
IDataReaderAdapter is an interface, it is a contract. I decided to implement the Lancaster ruleset in json, others might have their custom rule set in a CSV or XML file. The contract allows developers to import their ruleset regardless of how it is stored, they just need to write a concrete class, ie my JsonDataReaderAdapter, to turn the external file into a compatible data set that can be consumed by LancasterStemmer. This is just a good practice.
Yes, this should be a separate class and EnglishVowel is a good name for it. I will wrap this class with a factory class, so future implementations in other languages can be created easily.
This is a preference based issue. I will change the test class names and namespaces. I always prefer the Tests namespace.

Thanks, --Dan

On Tue, Aug 20, 2013 at 1:23 PM, Angelos Katharopoulos < notifications@github.com> wrote:

Again, I appreciate your work but I will unfortunately have to ask you to wait a bit for me. I am in the process of moving most tests to phpunit and deciding which of my current tests are actually functional instead of unit tests and how I am going to handle them. After that, I will also reformat the code with php coding standards fixerhttps://github.com/fabpot/PHP-CS-Fixerand then I will create a develop branch.

Now regarding the pull request.

Why keep the rules in a different json file instead of keeping them in code like most Lancaster Stemmer implementations?

What is the point of the IDataReaderInterface? NlpTools has Document and TrainingSet to abstract away the actual data (training data, test data). PHP has Iterator interface. I would suggest if you want to add a json reader simply to add a JsonDocument. Although if you look at the first point, what I am really suggesting is to not have a file dependency at all.

NlpTools\Utils\Vowel should really be NlpTools\Utils\EnglishVowels. I think PorterStemmer uses the same concept of vowels so, although otherwise I would think that such a small abstraction is an overkill, I believe it could also be used elsewhere in the system.

This one is something you wouldn't possibly know (not pushed yet), I am adding the tests in the same namespace as the class that is being tested, for instance PennTreeBankTokenizerTest is now in NlpTools\Tokenizers namespace.

So to sum up, I will push the develop branch in due time. Branch off of there, apply the above changes (or argue with me :-) ) and then I 'll merge that one.

— Reply to this email directly or view it on GitHubhttps://github.com/angeloskath/php-nlp-tools/pull/5#issuecomment-22961868 .

angeloskath commented 11 years ago

Hello,

I am not arguing about coupling the rules with the class. I am just saying that what the class needs is a collection of rules, so it should be passed a collection of rules. I propose a mixed parameter, either an instance of Traversable or an array. if (!(is_array($rules) || $rules instanceof \Traversable)) { throw new \InvalidArgumentException('...'); } also the default rules could be a private static variable used when $rules is null.
I know what an interface is and what its point in OOP is. What I wrote above is simply that I believe the IDataReader interface is not needed because it
- Overlaps with other NlpTools interfaces
- Should not be used in the first place because it is not what the stemmer cares about. The stemmer cares about rules. I should be able to do this $stemmer = new LancasterStemmer(array(....rules here....)); for testing and flexibility.
- I am not mentioning error handling for the implementation of JsonReaderAdapter because I believe we don't need it anyway.
Good, we might also need other useful methods (especially for other languages). This will need a bit of research but it might turn out quite useful. For instance I use almost constantly a helper for lower casing greek words because I also want to remove accent and special end of word characters.
Thanks, it is indeed a preference issue.

Just as a general direction for the library. I have no intention of having resources of any kind in the repository (except for testing). No prebuilt models, datasets, etc. That is an additional issue that I have with the external file ruleset.

Angelos

yooper commented 11 years ago

Per your requests I updated my pull request with your suggestions.

angeloskath commented 11 years ago

Thanks, I am sorry that I will have you doing some more changes.

Code changes

I suggest a protected static variable instead of the member function getRuleSet(). Also the constructor could be receiving a rule set array as a parameter and in the case of an empty array default to the static variable ruleSet.
I also suggest naming the utility classes Vowels (EnglishVowels, VowelsAbstractFactory). I feel I am nitpicking with this suggestion but since it doesn't feel right and there is the chance you agree with me, why not.

Regarding git usage

I have done quite a bad job keeping my git history clean and I would like to improve on that. I suggest keeping the develop branch synced with origin. If you want to make an addition make a feature branch and make a pull request from there. I will also be doing that locally so even though I will not be making pull requests to myself I will have commits merged with --no-ff from local private feature branches.

So regarding this pull request I suggest we close it. Then to your local repo:

Create another branch copy of your develop
Go back to develop, reset to a state previous to the current one online
Pull the online develop
Create a feature branch e.g. lancaster-stemmer
Reapply your changes and create a new pull request

yooper commented 11 years ago

I am closing this pull request and I will resubmit the PR against the develop branch

angeloskath / php-nlp-tools

Adding Lancaster Stemmer, Vowels utility class and IDataReaderAdapter for importing data sets #5

Code changes

Regarding git usage