Closed alaindesilets closed 5 years ago
I'd say this can only happen if process() was never called - i.e. if the StanfordNamedEntityRecognizerTrainer is used in a pipeline which processes zero documents. It doesn't make sense to train a model on zero documents. What do you think should be the expected behavior in such a situation? Do you think the pipeline should still complete without an error?
On Tue, Jul 16, 2019 at 3:40 PM Richard Eckart de Castilho < notifications@github.com> wrote:
I'd say this can only happen if process() was never called - i.e. if the StanfordNamedEntityRecognizerTrainer is used in a pipeline which processes zero documents. It doesn't make sense to train a model on zero documents. What do you think should be the expected behavior in such a situation? Do you think the pipeline should still complete without an error?
Hum... that might be what is happening. I'll take a look.
If that is the case, then I would expect the trainer to raise an exception saying that no training documents were processed. The exception raised is just too unhelpful for trouble shooting the problem.
Alternatively, the pipeline could exit gracefully and just not produce a model at all.
Alain
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4DVZNPHD7MWTXPX2STP7YP3VA5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2B5RBA#issuecomment-511957124, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4AJERH4GHTUETVICZ3P7YP3VANCNFSM4IEEDZCA .
I slightly tend towards checking if tempData==null
in collectionProcessComplete()
(meaning that process()
was never called) and throwing an exception with a more meaningful message like "Unable to train a model, no documents have been processed". The reason for my preference is that the user would probably expect the model file to be present once the pipeline is complete - and if it is not there, the user-code might fail. Then the user would go around to search for the problem. We could save that time throwing the exception and thus advising the user of the issue.
Makes sense. But first let me make sure that the null is indeed caused by the fact that no training documents were found. I'll get back to you on that.
On Wed, Jul 17, 2019 at 5:33 AM Richard Eckart de Castilho < notifications@github.com> wrote:
I slightly tend towards checking if tempData==null in collectionProcessComplete() (meaning that process() was never called) and throwing an exception with a more meaningful message like "Unable to train a model, no documents have been processed". The reason for my preference is that the user would probably expect the model file to be present once the pipeline is complete - and if it is not there, the user-code might fail. Then the user would go around to search for the problem. We could save that time throwing the exception and thus advising the user of the issue.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4BXQFBWG2O2LGGRXXLP73RO3A5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DTZ2Y#issuecomment-512179435, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4ELSVZEUGPUCWQ3RR3P73RO3ANCNFSM4IEEDZCA .
I am trying to write a unit test that invokes the StanfordNamedEntityRecognizerTrainer on an empty collection of training documents. But I can't figure out how to create an empty dataset.
In StanfordNamedEntityRecognizerTrainerTest, datasets are created by invoking
DatasetFactory loader = new DatasetFactory(DkproTestContext. getCacheFolder());
ds = loader.load("germeval2014-de");
I tried these, but they raise exceptions:
ds = new Dataset(); // Cannot instantiate Dataset
ds = loader.load(null); // Unknown dataset null
ds = loader.load(""); // Unknown dataset []
ds = loader.load(); // load requires an argument
I looked in dkpro-core-api-datasets-asl to see if I could add a new empty dataset to the factory, but it looks pretty complicated to do.
Any advice on how I might achieve that?
Thx.
On Wed, Jul 17, 2019 at 6:55 AM Alain Désilets alaindesilets0@gmail.com wrote:
Makes sense. But first let me make sure that the null is indeed caused by the fact that no training documents were found. I'll get back to you on that.
On Wed, Jul 17, 2019 at 5:33 AM Richard Eckart de Castilho < notifications@github.com> wrote:
I slightly tend towards checking if tempData==null in collectionProcessComplete() (meaning that process() was never called) and throwing an exception with a more meaningful message like "Unable to train a model, no documents have been processed". The reason for my preference is that the user would probably expect the model file to be present once the pipeline is complete - and if it is not there, the user-code might fail. Then the user would go around to search for the problem. We could save that time throwing the exception and thus advising the user of the issue.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4BXQFBWG2O2LGGRXXLP73RO3A5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2DTZ2Y#issuecomment-512179435, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4ELSVZEUGPUCWQ3RR3P73RO3ANCNFSM4IEEDZCA .
The most straight-foward way would be to set up a unit test which instantiates the component (e.g. using uimaFIT's createEngine(...)
) and then calls collectionProcessComplete()
on it.
I was going more for a public API-level test that ensures that the class will fail gracefully if you try to train it on an empty collection.
What you propose is a lower level test that checks that the class fails gracefully if collectionProcessComplete gets invoked before process().
I think a public API level test is more readable and its intent is clearer so I would like to stick with that approach.
Surely there must be an easier way to create a Dataset instance than going through all that DatasetFactory, yml files, etc... stuff? If not, maybe I should implement one?
Alain
On Wed, Jul 17, 2019 at 5:00 PM Richard Eckart de Castilho < notifications@github.com> wrote:
The most straight-foward way would be to set up a unit test which instantiates the component (e.g. using uimaFIT's createEngine(...)) and then calls collectionProcessComplete() on it.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4CCGIMXLASHNQGZ3STP76CADA5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2GSMYA#issuecomment-512566880, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4F4MPXURGQM32LC643P76CADANCNFSM4IEEDZCA .
OK, I found a way to do this. I create a temporary empty folder and then pass it to the Conll reader. By doing it that way, I am able to reproduce the bug, which confirms that this is how tempData is accessed without process() being invoked first.
I will now change the code to make it raise a more helpful exception in those circumstances.
On Wed, Jul 17, 2019 at 6:05 PM Alain Désilets alaindesilets0@gmail.com wrote:
I was going more for a public API-level test that ensures that the class will fail gracefully if you try to train it on an empty collection.
What you propose is a lower level test that checks that the class fails gracefully if collectionProcessComplete gets invoked before process().
I think a public API level test is more readable and its intent is clearer so I would like to stick with that approach.
Surely there must be an easier way to create a Dataset instance than going through all that DatasetFactory, yml files, etc... stuff? If not, maybe I should implement one?
Alain
On Wed, Jul 17, 2019 at 5:00 PM Richard Eckart de Castilho < notifications@github.com> wrote:
The most straight-foward way would be to set up a unit test which instantiates the component (e.g. using uimaFIT's createEngine(...)) and then calls collectionProcessComplete() on it.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dkpro/dkpro-core/issues/1395?email_source=notifications&email_token=AAIMA4CCGIMXLASHNQGZ3STP76CADA5CNFSM4IEEDZCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2GSMYA#issuecomment-512566880, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIMA4F4MPXURGQM32LC643P76CADANCNFSM4IEEDZCA .
I have a pipeline that trains Stanford NER. When I run it, I get the following stack trace:
The null pointer is attribute
tempData
, which is initialized inprocess()
(at the same time as attributeout
). But judging from the stacktrace, it seems thattempData
is invoked beforeprocess()
.A solution would be to have getters
getOut()
andgetTempData()
which would returnout
andtempData
respectively, after making sure that they have been initialized if they are null at the moment where the getter is invoked.I have implemented this approach and it seems to be working (unit tests for stanford still running)