dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Stanford NER trainer raises helpful exception when no data received #1395

Closed alaindesilets closed 5 years ago

alaindesilets commented 5 years ago

I have a pipeline that trains Stanford NER. When I run it, I get the following stack trace:

java.lang.NullPointerException
    at org.dkpro.core.stanfordnlp.StanfordNamedEntityRecognizerTrainer.collectionProcessComplete(StanfordNamedEntityRecognizerTrainer.java:252)
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.collectionProcessComplete(PrimitiveAnalysisEngine_impl.java:343)
    at org.apache.uima.fit.util.LifeCycleUtil.collectionProcessComplete(LifeCycleUtil.java:49)
    at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:230)
    at ca.nrc.nlp.dkpro.named_entities.NERTrainingPipeline.process(NERTrainingPipeline.java:98)
    at ca.nrc.nlp.dkpro.named_entities.NERTrainingPipelineTest.

The null pointer is attribute tempData, which is initialized in process() (at the same time as attribute out). But judging from the stacktrace, it seems that tempData is invoked before process().

A solution would be to have getters getOut() and getTempData() which would return out and tempData respectively, after making sure that they have been initialized if they are null at the moment where the getter is invoked.

I have implemented this approach and it seems to be working (unit tests for stanford still running)

reckart commented 5 years ago

I'd say this can only happen if process() was never called - i.e. if the StanfordNamedEntityRecognizerTrainer is used in a pipeline which processes zero documents. It doesn't make sense to train a model on zero documents. What do you think should be the expected behavior in such a situation? Do you think the pipeline should still complete without an error?

alaindesilets commented 5 years ago

On Tue, Jul 16, 2019 at 3:40 PM Richard Eckart de Castilho < notifications@github.com> wrote:

I'd say this can only happen if process() was never called - i.e. if the StanfordNamedEntityRecognizerTrainer is used in a pipeline which processes zero documents. It doesn't make sense to train a model on zero documents. What do you think should be the expected behavior in such a situation? Do you think the pipeline should still complete without an error?

Hum... that might be what is happening. I'll take a look.

If that is the case, then I would expect the trainer to raise an exception saying that no training documents were processed. The exception raised is just too unhelpful for trouble shooting the problem.

Alternatively, the pipeline could exit gracefully and just not produce a model at all.

Alain

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4DVZNPHD7MWTXPX2STP7YP3VA5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2B5RBA#issuecomment-511957124, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4AJERH4GHTUETVICZ3P7YP3VANCNFSM4IEEDZCA .

reckart commented 5 years ago

I slightly tend towards checking if tempData==null in collectionProcessComplete() (meaning that process() was never called) and throwing an exception with a more meaningful message like "Unable to train a model, no documents have been processed". The reason for my preference is that the user would probably expect the model file to be present once the pipeline is complete - and if it is not there, the user-code might fail. Then the user would go around to search for the problem. We could save that time throwing the exception and thus advising the user of the issue.

alaindesilets commented 5 years ago

Makes sense. But first let me make sure that the null is indeed caused by the fact that no training documents were found. I'll get back to you on that.

On Wed, Jul 17, 2019 at 5:33 AM Richard Eckart de Castilho < notifications@github.com> wrote:

I slightly tend towards checking if tempData==null in collectionProcessComplete() (meaning that process() was never called) and throwing an exception with a more meaningful message like "Unable to train a model, no documents have been processed". The reason for my preference is that the user would probably expect the model file to be present once the pipeline is complete - and if it is not there, the user-code might fail. Then the user would go around to search for the problem. We could save that time throwing the exception and thus advising the user of the issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4BXQFBWG2O2LGGRXXLP73RO3A5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DTZ2Y#issuecomment-512179435, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4ELSVZEUGPUCWQ3RR3P73RO3ANCNFSM4IEEDZCA .

alaindesilets commented 5 years ago

I am trying to write a unit test that invokes the StanfordNamedEntityRecognizerTrainer on an empty collection of training documents. But I can't figure out how to create an empty dataset.

In StanfordNamedEntityRecognizerTrainerTest, datasets are created by invoking

DatasetFactory loader = new DatasetFactory(DkproTestContext. getCacheFolder());

ds = loader.load("germeval2014-de");

I tried these, but they raise exceptions:

ds = new Dataset(); // Cannot instantiate Dataset

ds = loader.load(null); // Unknown dataset null

ds = loader.load(""); // Unknown dataset []

ds = loader.load(); // load requires an argument

I looked in dkpro-core-api-datasets-asl to see if I could add a new empty dataset to the factory, but it looks pretty complicated to do.

Any advice on how I might achieve that?

Thx.

On Wed, Jul 17, 2019 at 6:55 AM Alain Désilets alaindesilets0@gmail.com wrote:

Makes sense. But first let me make sure that the null is indeed caused by the fact that no training documents were found. I'll get back to you on that.

On Wed, Jul 17, 2019 at 5:33 AM Richard Eckart de Castilho < notifications@github.com> wrote:

I slightly tend towards checking if tempData==null in collectionProcessComplete() (meaning that process() was never called) and throwing an exception with a more meaningful message like "Unable to train a model, no documents have been processed". The reason for my preference is that the user would probably expect the model file to be present once the pipeline is complete - and if it is not there, the user-code might fail. Then the user would go around to search for the problem. We could save that time throwing the exception and thus advising the user of the issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4BXQFBWG2O2LGGRXXLP73RO3A5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DTZ2Y#issuecomment-512179435, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4ELSVZEUGPUCWQ3RR3P73RO3ANCNFSM4IEEDZCA .

reckart commented 5 years ago

The most straight-foward way would be to set up a unit test which instantiates the component (e.g. using uimaFIT's createEngine(...)) and then calls collectionProcessComplete() on it.

alaindesilets commented 5 years ago

I was going more for a public API-level test that ensures that the class will fail gracefully if you try to train it on an empty collection.

What you propose is a lower level test that checks that the class fails gracefully if collectionProcessComplete gets invoked before process().

I think a public API level test is more readable and its intent is clearer so I would like to stick with that approach.

Surely there must be an easier way to create a Dataset instance than going through all that DatasetFactory, yml files, etc... stuff? If not, maybe I should implement one?

Alain

On Wed, Jul 17, 2019 at 5:00 PM Richard Eckart de Castilho < notifications@github.com> wrote:

The most straight-foward way would be to set up a unit test which instantiates the component (e.g. using uimaFIT's createEngine(...)) and then calls collectionProcessComplete() on it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4CCGIMXLASHNQGZ3STP76CADA5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2GSMYA#issuecomment-512566880, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4F4MPXURGQM32LC643P76CADANCNFSM4IEEDZCA .

alaindesilets commented 5 years ago

OK, I found a way to do this. I create a temporary empty folder and then pass it to the Conll reader. By doing it that way, I am able to reproduce the bug, which confirms that this is how tempData is accessed without process() being invoked first.

I will now change the code to make it raise a more helpful exception in those circumstances.

On Wed, Jul 17, 2019 at 6:05 PM Alain Désilets alaindesilets0@gmail.com wrote:

I was going more for a public API-level test that ensures that the class will fail gracefully if you try to train it on an empty collection.

What you propose is a lower level test that checks that the class fails gracefully if collectionProcessComplete gets invoked before process().

I think a public API level test is more readable and its intent is clearer so I would like to stick with that approach.

Surely there must be an easier way to create a Dataset instance than going through all that DatasetFactory, yml files, etc... stuff? If not, maybe I should implement one?

Alain

On Wed, Jul 17, 2019 at 5:00 PM Richard Eckart de Castilho < notifications@github.com> wrote:

The most straight-foward way would be to set up a unit test which instantiates the component (e.g. using uimaFIT's createEngine(...)) and then calls collectionProcessComplete() on it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4CCGIMXLASHNQGZ3STP76CADA5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2GSMYA#issuecomment-512566880, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4F4MPXURGQM32LC643P76CADANCNFSM4IEEDZCA .