Faster precision recall

emanjavacas / pie

A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.

MIT License

22 stars 10 forks source link

Faster precision recall #32

Closed PonteIneptique closed 5 years ago

PonteIneptique commented 5 years ago

This one possible improvement for #29.

Basically, in sklearn.metrics.classifications, both precision_score and recall_score are using precision_recall_fscore_support. This mean the function was used twice for the same result in

https://github.com/emanjavacas/pie/blob/9fa672c04659b2c2219cb704027f48ba703c28b7/pie/models/scorer.py#L37-L38

PS: I added a change to the cli for evaluation because it was boring me to death to have to deal with this copy past everytime. Now pie eval loads path from --settings when none others are provided.

PonteIneptique commented 5 years ago

This will not completely fix #29, but is one good step.

PonteIneptique commented 5 years ago

I just removed a secondary loop as well. In the case of the Latin corpora, this would re-iterate over 200k tokens :)

PonteIneptique commented 5 years ago

I just spotted another places for improvement...

PonteIneptique commented 5 years ago

So, Scorer().get_scores() can be called in the two places: in train which is not important because it is a the end of training and in Scorer.print_summary.

https://github.com/emanjavacas/pie/blob/9fa672c04659b2c2219cb704027f48ba703c28b7/pie/models/scorer.py#L262

Scorer.print_summary itself is called in trainer.run_check() https://github.com/emanjavacas/pie/blob/9fa672c04659b2c2219cb704027f48ba703c28b7/pie/trainer.py#L267

But in the same function, 10 lines later, we also got get_scores() called https://github.com/emanjavacas/pie/blob/9fa672c04659b2c2219cb704027f48ba703c28b7/pie/trainer.py#L272

This mean get_score is actually called twice per Trainer.run_check() per task.

PonteIneptique commented 5 years ago

@emanjavacas We could use a multiprocess.Pool here as well, such as

pool = multiprocessing.Pool(int(len(self.tasks) / 2))
 # Dict are now ordered, so we should not worry about it
stored_scores = zip(summary.keys(), pool.map(lambda x: x.get_scores(), summary.values()))

But it can be another discussion.

emanjavacas commented 5 years ago

Is this ready to merge?

PonteIneptique commented 5 years ago

This is, multiprocessing will come later :)

emanjavacas commented 5 years ago

Alright. Thanks!