Inference script and dataset-specific post-processing (STS, NLI)

In this PR, I've created a separate inference script, and created dataset/task specific post-processing functions.

New

eval.py script is implemented. It can be used for inference after fine-tuning a model.
Under the generation_conf folder, inference configurations are placed. One needs to specify a fine-tuned model location inside the used conf file. In the future additional items can be added to the generation confs, such as beam size, etc.
Added scikit-learn to requirements.txt to compute metrics such as accuracy.
Added post-processing functions and evaluation metrics for STS and NLI.

Changes

Added save best model at the end of fine-tuning, since it only saved the last 3 epochs, and not the best model.
Removed compute_metrics from finetune.py script to avoid duplication. Instead, the compute_metrics function of the Evaluator is used.

Tested with 10 train-val-test samples for STS, NLI and summarization.

boun-tabi-LMG / turkish-lm-tuner