PicNet / PicNetML

PicNetML is a .Net wrapper for the Weka project.
http://www.picnet.com.au
GNU General Public License v3.0
8 stars 2 forks source link

Integrating Stemmers #1

Open aolney opened 10 years ago

aolney commented 10 years ago

Tried a number of alternative approaches, but for example put snowball-20051019.jar in lib\weka\packages*.jar and re-ran ikvmc. This puts stemmers in the same weka.dll with everything else.

Although code like this succeeds

let found = java.lang.Class.forName("org.tartarus.snowball.ext.porterStemmer")

The "discovery" code used by the stemmers fails, e.g.

let goe = weka.gui.GenericObjectEditor.getClassnames("org.tartarus.snowball.SnowballProgram");

let cd = weka.core.ClassDiscovery.find("org.tartarus.snowball.SnowballProgram","org.tartarus.snowball.ext")

Discover nothing. It seems that no plugins are being loaded for GOE.

Any suggestion on how to make this work properly?

For the curious, here is a workaround that uses reflection:

//lets us set a private property in general let dynamicSet x propName propValue = let property = x.GetType().GetField(propName, BindingFlags.Static ||| BindingFlags.FlattenHierarchy ||| BindingFlags.Instance ||| BindingFlags.NonPublic ||| BindingFlags.Public) property.SetValue(x, propValue)

//get a stemmer wrapper; it won't find any implementations by default let stemmer = new weka.core.stemmers.SnowballStemmer(); //in its init function, the stemmer should have populated a list of available stemmers. We do that manually. let options = new java.util.Vector() options.add("porter") |> ignore //use reflection to set the private list of available stemmers dynamicSet stemmer "m_Stemmers" options //setstemmer checks this list, and if it finds "porter", connects the wrapper to the implementation stemmer.setStemmer("porter");

gatapia commented 10 years ago

I would have to play with this to see, I have not used the weka.gui namespace so not even sure what GEO is supposed to do. I assume set properties on algorithms? Why not just set them manually?

Or is this some kind of auto initialization that is not happening? Does Java weka go through all the Jars and find a list of available stemmers?? Perhaps the IKVM process broke that introspection. I would have look into the weka source to see whats going on there. I'll see if I can look into this in the coming weeks

aolney commented 10 years ago

The process is somewhat explained here, though I think the doc is out of date:

http://weka.sourceforge.net/doc.dev/weka/core/stemmers/SnowballStemmer.html

Basically there is a "wrapper" for the stemmer that is included in weka, but the implementation is in another jar.

Somehow, during initialization, the dynamic loader is supposed to find the base class implementation for all the stemmers. That's this line:

let goe = weka.gui.GenericObjectEditor.getClassnames("org.tartarus.snowball.SnowballProgram");

GOE apparently uses a config file called GenericObjectEditor.props. I pulled this out of the weka.jar distributed with PicNetML and put it in the bin/Debug of my application. This reduces some Intellitrace error messages, which suggests to me that the config is being found.

The GOE.props file has a section for stemmers but not for the implementation, so I've tried it as is and with the following lines added:

org.tartarus.snowball.SnowballProgram=\ org.tartarus.snowball.ext.porterStemmer

Similarly I've extracted GenericPropertiesCreator.props and put it in the bin/Debug folder and tried making changes there, including

UseDynamic=false OR UseDynamic=true

and uncommenting

org.tartarus.snowball.SnowballProgram=\ org.tartarus.snowball.ext

Turning off dynamic discovery should have enabled this line to work:

let cd = weka.core.ClassDiscovery.find("org.tartarus.snowball.SnowballProgram","org.tartarus.snowball.ext")

but it doesn't seem to.

I installed a trial of Reflector.Net Pro, which lets me "step into sources" on dlls. I thought maybe I could step into the line that initialized the stemmer to see what's going wrong.

This slightly works, but only about 50% of the sources nicely decompile and the rest are IL. However, it does seem that the GenericObjectEditor has a PluginManager it calls, and the PluginManager never has anything loaded. This suggests to me that either the properties file isn't being read or somehow the plugin loading isn't happening even though the properties file is read.

This has another implication, which is that "any" plugin will fail to load. This makes me wonder if any of the other "packages" that have been distributed with PicNetML, such as SMOTE, are being properly discovered.

Could you recommend a test case for me to try with these? That would help establish if this is something stemmer specific.

Here's a case that suggests there might be something wrong with the stemmer jar specifically, though it is merely suggestive:

http://stackoverflow.com/questions/17238184/weka-and-snowball-dont-work-when-exported-in-jar

gatapia commented 10 years ago

Smote, liblinear , etc included work fine. However, when a new DLL is integrated into weak.dll I rerun the code generator. Whatever the issue with Stemmers I'm sure it can be addressed here (at the code gen stage). Perhaps I will just use your workaround code to initialise Stemmers I find through reflection. This will have to wait tho. I'm currently away for the holidays. I'll look into it soon tho.

aolney commented 10 years ago

I took a look at the code generation, and it's not clear to me how re-running it would solve the problem. It seems that of the existing packages, liblinear-1.92 is closest b/c it has a non weka-rooted classpath (it is rooted in de). However I'm not sure these packages are loaded by weka the same way as the stemmers.

Right now I'm wondering if the problem is in PicNetML or in ikvm. I suppose the best test scenario would be to implement StringToWordVector using the weka.dll produced by ikvm directly (without PicNetML) and see if the problems still appears.

aolney commented 10 years ago

Another follow up -- do you have a reference or doc for some of your design decisions in creating this wrapper?

I'd like to find out more about how you did the code gen.

On 12/21/2013 02:27 PM, Guido Tapia wrote:

Smote, liblinear , etc included work fine. However, when a new DLL is integrated into weak.dll I rerun the code generator. Whatever the issue with Stemmers I'm sure it can be addressed here (at the code gen stage). Perhaps I will just use your workaround code to initialise Stemmers I find through reflection. This will have to wait tho. I'm currently away for the holidays. I'll look into it soon tho.

— Reply to this email directly or view it on GitHub https://github.com/PicNet/PicNetML/issues/1#issuecomment-31071524.