Open GoogleCodeExporter opened 9 years ago
Hi!
It looks to me like what you want is to use the "Simple Lines" pipe (where each
line is treated as a bag-of-words), rather than "Entities" (where you can
specify arbitrary features and their values). Entites takes input of the form:
noun phrase <tab> feature1||value1 <tab> feature2||value2 ... etc.
for example:
directory listing line_of_X||5 it_'s_X||5 site_is_X||3 ...
Quigley name_is_X||5 X_graduated_with||3 X_'s_view||3 ...
twin you_use_X||3 wages_for_X||27 versions_of_X||3 ...
Furthermore, if the data is labeled, you need to separate instances into
FOLDERS with the corresponding labels, as per the README.txt instructions. The
indicvidual lines do not contain labels, only the text itself. The labels come
from the folder structure.
In exploration mode, you do not have to have these labels/folders. They are
ignored and the set of labels entered in the "class labels" box is used.
In experiment mode, the labels/folders are used for evaluation of the annotator
and the learned model.
Note that for the moment there's no way to "jumpstart" DUALIST with mixed
labeled and unlabeled documents. It's something sort of planned for the future,
but not off the ground yet.
Hope that helps!
Original comment by burrsett...@gmail.com
on 20 Sep 2011 at 3:22
Thanks a lot, Burr! Now the lines pipe gets me further.
If I'm using _only_ labeled data in exploration mode, and if I ZIP it up using
the folder structure described in the README.txt (attached the example file),
then I get the following:
--------------------------------------------------------------
Execution exception
ArrayIndexOutOfBoundsException occured : 22
In /app/controllers/Application.java (around line 321)
317: Logger.info("PASSIVE QUERYING");
318: logResult(timeSoFar+"\tPASSIVE");
319: queryInstances = Queries.randomInstances(unlabeledSet,
numInstances );
320: // queryFeatures =
Queries.randomFeaturesPerLabel(labeledFeatures, unlabeledSet, 50);
321: queryFeatures =
Queries.commonFeaturesPerLabel(labeledFeatures, unlabeledSet, 100);
322: }
323: // otherwise, query actively
324: else {
325: Logger.info("ACTIVE QUERYING");
326: logResult(timeSoFar+"\tACTIVE");
327: queryInstances = Queries.queryInstances(nbModel, unlabeledSet,
numInstances, "entropy" );
--------------------------------------------------------------
(My assumption is that the folders that represent class names need to be there
but are ignored as you say above.)
Do you have a "hello world" example (with associated GUI settings) that runs
that you could share?
Original comment by jlleid...@gmail.com
on 20 Sep 2011 at 7:25
Attachments:
Ah! This is because I didn't envision someone testing DUALIST with such a small
corpus. There are apparently only 22 words in your test documents, and the
method that tries to select features for labeling tries to find the top 100
features per label (this is hard-coded).
A fix for this is now on to the TODO list.
In the meantime, just try a dummy data set with a vocabulary greater than 100.
Original comment by burrsett...@gmail.com
on 20 Sep 2011 at 8:19
Original issue reported on code.google.com by
jlleid...@gmail.com
on 20 Sep 2011 at 12:33Attachments: