AnantLabs / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc
Other
0 stars 0 forks source link

CRFSuite CRFSuiteOutcomeIDReport #227

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
When I run the BrownPosDemoCRFSuite example the id2outcome.txt file contains 
three labels which are actually not in the label set: "Prediction", "#Gold", 
"". Those are gathered from the "predictions.txt" file produced during testing. 
The resulting id2outcome.txt file looks like this:

#ID=PREDICTION;GOLDSTANDARD
#labels  Prediction NPg JJ RB PPS TO DT RP RBR DOD JJT NR HV JJR NP NN VBN VB 
pct PPO BE HVD #Gold DTS MD WDT VBZ DTI AT BEZ IN ABX CS VBG VBD BEDZ QL NNS 
PPSS CC CD BER BEN AP WRB HVZ PPg
#Wed Dec 17 13:59:39 CET 2014

The error occurs in the method CRFSuiteOutcomeIDReport.getGoldAndPredictions() 
where the labels are collected from predictions.txt and subsequently assigned 
to numerical id's (method: createMappingLabel2Number(..)). 

I was also wondering if the "numerical id's" in id2outcome.txt should be 
identical with those in outcome-mapping.txt? With the current method this might 
be not ensured?!

Original issue reported on code.google.com by christia...@googlemail.com on 17 Dec 2014 at 4:34

GoogleCodeExporter commented 9 years ago
Hi,
the 'headline' labels slipped into the label list. This is indeed a bug. 

The change to this numerical values was implemented to account for the new 
evaluation module which expects to receive numerical values from the 
id2outcome.txt
I was not aware that there is another file which assigns already numerical 
values to the labels separately. 
It is probably the best idea to the read the outcome-mapping file instead of 
compiling one mapping in the report.

As far as I can see that from the demo, the outcome-mapping of train/test is 
(always?) the same? It doesn't matter which file I would read, right?

Original comment by Tobias.H...@gmail.com on 18 Dec 2014 at 8:13

GoogleCodeExporter commented 9 years ago
I'm not really sure about the TC policy for determining this mapping. It seems 
that the class names are just sorted ... similar to the method 
SmallContingencyTables.classNamesToMapping(..) Is that right? 

If so, it might also make sense to sort the labels in the id2outcome.txt file 
for preventing confusion of the users or maybe even better to include the 
mapping in the outcome file?

Original comment by christia...@googlemail.com on 18 Dec 2014 at 8:20

GoogleCodeExporter commented 9 years ago
I wonder where else such a mapping is done. This looks like code duplication to 
me. The Feature extraction does it, the machine learning adapter does it and 
the evaluation module does it too.
Maybe this mapping should move to somewhere else as it becomes more important 
with the new evaluation module?

Original comment by Tobias.H...@gmail.com on 18 Dec 2014 at 11:06

GoogleCodeExporter commented 9 years ago
The classlabel-to-number mapping in the id2outcome.txt (TestTask) and in 
outcome-mapping.txt (ExtractFeaturesTask) should be independent. There is no 
guarantee that they return the same mapping. Furthermore, the 
outcome-mapping.txt produced during training and testing can be different if 
there is a different set of classlabels in the train and test set.

I'm not sure what exactly outcome-mapping.txt is used for. Maybe Torsten can 
help here. 

I would opt not to mix the 2 mappings. Rather, I would suggest to make the 
mapping in id2outcome.txt explicit, i.e. instead of 

#ID=PREDICTION;GOLDSTANDARD
#labels NPg JJ RB PPS TO ...

we should have 

#ID=PREDICTION;GOLDSTANDARD
#labels 1=NPg 2=JJ 3=RB 4=PPS 5=TO

This needs to be fixed within the evaluation module (I'll open a separate 
issue).

The problem originally addressed in this issue is a bug in one of the CRFSuite 
reports and should be fixed there.

Original comment by daxenber...@gmail.com on 18 Dec 2014 at 3:02

GoogleCodeExporter commented 9 years ago
The last time I touched this, labels were still strings. So I cannot really 
help here. 

Original comment by torsten....@gmail.com on 18 Dec 2014 at 3:07

GoogleCodeExporter commented 9 years ago
Ok, I filter out the two wrong-labels and update the report as suggest in #4

Original comment by Tobias.H...@gmail.com on 18 Dec 2014 at 3:33

GoogleCodeExporter commented 9 years ago

Original comment by Tobias.H...@gmail.com on 18 Dec 2014 at 4:11