burrsettles / dualist

Interactive machine learning for text analysis
Other
85 stars 25 forks source link

Null pointer error due to empty alphabet size? #1

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. create series of textual lines, e.g.
NTR|the old man died suddenly.
NTR|Just a test.
NEG|boah, what utter bullocks!
NTR|maybe yes, maybe no.
POS|great!!
NTR|oh, I am surprised.
NTR|another test sentence to serve as training instance
NEG|bah, that totally sucks BIG TIME!!!

or (unlabeled)

the old man died suddenly.
Just a test.
boah, what utter bullocks!
maybe yes, maybe no.
great!!
oh, I am surprised.
another test sentence to serve as training instance
NEG|bah, that totally sucks BIG TIME!!!

2. launch Dualist, go to Web GUI
3. set parameters as shown in attached screen capture

What is the expected output? What do you see instead?

Execution exception
NullPointerException occured : null

In /app/controllers/Application.java (around line 101)

97:
         Cache.set(session.getId()+"-startTime", (System.currentTimeMillis()/1000), "90mn" );
98:

99:
         Cache.set(session.getId()+"-explore", true, "90mn");
100:

101:
         Logger.info("|featureSet|=%s", dataAlphabet.size());
102:
         Logger.info("|labelSet|=%s", labelAlphabet.size());
103:
         Logger.info("|dataSet|=%s", ilist.size());
104:
         Logger.info("User: %s", username);
105:

106:
         clearResult();
107:
         logResult("%% |featureSet|=" + dataAlphabet.size());
This exception has been logged with id 67nj9836a

What version of the product are you using? On what operating system?
Dualist 0.1 on MacOS 10.6.7.

Please provide any additional information below.

I also tried with various other separators ('#', '|', '||', ' ', '\t') - the 
result is unchanged. The error appears to be caused by an empty set of labels, 
but I've entered them using a comma as separator as suggested.

Original issue reported on code.google.com by jlleid...@gmail.com on 20 Sep 2011 at 12:33

Attachments:

GoogleCodeExporter commented 9 years ago
Hi!

It looks to me like what you want is to use the "Simple Lines" pipe (where each 
line is treated as a bag-of-words), rather than "Entities" (where you can 
specify arbitrary features and their values). Entites takes input of the form:

    noun phrase <tab> feature1||value1 <tab> feature2||value2 ... etc.

for example:

    directory listing   line_of_X||5    it_'s_X||5  site_is_X||3 ...
    Quigley name_is_X||5    X_graduated_with||3 X_'s_view||3 ...
    twin    you_use_X||3    wages_for_X||27 versions_of_X||3 ...

Furthermore, if the data is labeled, you need to separate instances into 
FOLDERS with the corresponding labels, as per the README.txt instructions. The 
indicvidual lines do not contain labels, only the text itself. The labels come 
from the folder structure.

In exploration mode, you do not have to have these labels/folders. They are 
ignored and the set of labels entered in the "class labels" box is used.

In experiment mode, the labels/folders are used for evaluation of the annotator 
and the learned model.

Note that for the moment there's no way to "jumpstart" DUALIST with mixed 
labeled and unlabeled documents. It's something sort of planned for the future, 
but not off the ground yet.

Hope that helps!

Original comment by burrsett...@gmail.com on 20 Sep 2011 at 3:22

GoogleCodeExporter commented 9 years ago
Thanks a lot, Burr! Now the lines pipe gets me further.

If I'm using _only_ labeled data in exploration mode, and if I ZIP it up using 
the folder structure described in the README.txt (attached the example file), 
then I get the following:

--------------------------------------------------------------
Execution exception
ArrayIndexOutOfBoundsException occured : 22

In /app/controllers/Application.java (around line 321)

317:             Logger.info("PASSIVE QUERYING");
318:             logResult(timeSoFar+"\tPASSIVE");
319:             queryInstances = Queries.randomInstances(unlabeledSet, 
numInstances );
320:             //         queryFeatures = 
Queries.randomFeaturesPerLabel(labeledFeatures, unlabeledSet, 50);
321:             queryFeatures = 
Queries.commonFeaturesPerLabel(labeledFeatures, unlabeledSet, 100);
322:         }
323:         // otherwise, query actively
324:         else {
325:             Logger.info("ACTIVE QUERYING");
326:             logResult(timeSoFar+"\tACTIVE");
327:             queryInstances = Queries.queryInstances(nbModel, unlabeledSet, 
numInstances, "entropy" );
--------------------------------------------------------------

(My assumption is that the folders that represent class names need to be there 
but are ignored as you say above.)

Do you have a "hello world" example (with associated GUI settings) that runs 
that you could share?

Original comment by jlleid...@gmail.com on 20 Sep 2011 at 7:25

Attachments:

GoogleCodeExporter commented 9 years ago
Ah! This is because I didn't envision someone testing DUALIST with such a small 
corpus. There are apparently only 22 words in your test documents, and the 
method that tries to select features for labeling tries to find the top 100 
features per label (this is hard-coded).

A fix for this is now on to the TODO list.

In the meantime, just try a dummy data set with a vocabulary greater than 100.

Original comment by burrsett...@gmail.com on 20 Sep 2011 at 8:19