This PR includes a retrieval based supervised summarizer implemented using lightgbm ranker.
sadedegel.dataset.annotated is used with its sentence, relevance pairs to train ranker.
Evaluation is done by leave-one-out cross validation due to small number of documents (~100).
Optimization with optuna is also implemented for user specified summarization length or picked embedding type.
test_supervised.py
Implement test for initializing ranker with lazy loading of the appropriate model.
Test re-loading of model when embedding type is switched.
Test for summary output with specified sentence length.
supervised.py
Implement SupervisedSentenceRankerclass as child of ExtractiveSummarizer.
Embedding generation phase prepares string input to doc-sentence representation for the LGBMRanker. Decouple embedding generation for transformer based and BoW based representations from predict method.
Implement a tuner class as RankerOptimizer if the user requires an optimized ranker for a summarization_percentage and another embedding with vector_type. Inherit SupervisedSentenceRanker for its embedding extraction methods.
_prepare_dataset uses extraction methods to prepare dataset for the format required for LGBMRanker.
util/supervised_tuning.py
Implement components for optimization of ranker.
Implement Logging and parsing for parameters of the best trial.
Implement Objective function for optuna with sampling of parameter space.
Implement Callback for Live status update rather that verbosity of optuna.
Implement fitting and saving for model with best hyperparameters.
README.md
Update with usage of all summarizers.
Add usage of supervised sentence ranker and tuner.
Add scores for the ranker.
model/ranker_bert_128k_cased.joblib
Add default model for the ranker.
User trained custom rankers via RankerOptimizer are serialized to ~/.sadedegel_data/models
lightgbm ranker
.sadedegel.dataset.annotated
is used with its sentence, relevance pairs to train ranker.test_supervised.py
supervised.py
SupervisedSentenceRanker
class as child ofExtractiveSummarizer
.LGBMRanker
. Decouple embedding generation for transformer based and BoW based representations from predict method.RankerOptimizer
if the user requires an optimized ranker for asummarization_percentage
and another embedding withvector_type
. InheritSupervisedSentenceRanker
for its embedding extraction methods._prepare_dataset
uses extraction methods to prepare dataset for the format required forLGBMRanker
.util/supervised_tuning.py
optuna
with sampling of parameter space.optuna
.README.md
model/ranker_bert_128k_cased.joblib
RankerOptimizer
are serialized to~/.sadedegel_data/models