CLARIN-PL / embeddings

Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language
https://clarin-pl.github.io/embeddings/
MIT License
36 stars 3 forks source link

FlairCNNDocumentEmbeddings IndexError #136

Closed djaniak closed 2 years ago

djaniak commented 2 years ago

Error spotted while performing the hyperparameter search on static embeddings in LockedDropout

Stack trace ```python Traceback (most recent call last): File "/home/djaniak/anaconda3/envs/clarinpl-embeddings/lib/python3.9/site-packages/optuna/study/_optimize.py", line 213, in _run_trial value_or_values = func(trial) File "/home/djaniak/embeddings/embeddings/pipeline/hps_pipeline.py", line 123, in objective results = pipeline.run() File "/home/djaniak/embeddings/embeddings/pipeline/evaluation_pipeline.py", line 50, in run model_result = self.model.execute(loaded_data) File "/home/djaniak/embeddings/embeddings/model/flair_model.py", line 28, in execute return self.task.fit_predict(data, self.predict_subset) File "/home/djaniak/embeddings/embeddings/task/flair_task/flair_task.py", line 70, in fit_predict self.fit(data) File "/home/djaniak/embeddings/embeddings/task/flair_task/flair_task.py", line 39, in fit log: Dict[Any, Any] = self.trainer.train( File "/home/djaniak/anaconda3/envs/clarinpl-embeddings/lib/python3.9/site-packages/flair/trainers/trainer.py", line 467, in train loss = self.model.forward_loss(batch_step) File "/home/djaniak/anaconda3/envs/clarinpl-embeddings/lib/python3.9/site-packages/flair/nn/model.py", line 489, in forward_loss scores, labels = self.forward_pass(sentences) File "/home/djaniak/anaconda3/envs/clarinpl-embeddings/lib/python3.9/site-packages/flair/models/text_classification_model.py", line 60, in forward_pass self.document_embeddings.embed(sentences) File "/home/djaniak/anaconda3/envs/clarinpl-embeddings/lib/python3.9/site-packages/flair/embeddings/base.py", line 60, in embed self._add_embeddings_internal(sentences) File "/home/djaniak/anaconda3/envs/clarinpl-embeddings/lib/python3.9/site-packages/flair/embeddings/document.py", line 912, in _add_embeddings_internal outputs = self.locked_dropout(outputs) File "/home/djaniak/anaconda3/envs/clarinpl-embeddings/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/djaniak/anaconda3/envs/clarinpl-embeddings/lib/python3.9/site-packages/flair/nn/dropout.py", line 22, in forward m = x.data.new(x.size(0), 1, x.size(2)).bernoulli_(1 - self.dropout_rate) IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2) ```
Code to reproduce: ```python import pprint from pathlib import Path import typer from embeddings.data.data_loader import HuggingFaceDataLoader from embeddings.data.dataset import HuggingFaceDataset from embeddings.defaults import RESULTS_PATH from embeddings.embedding.flair_embedding import FlairDocumentCNNEmbeddings from embeddings.embedding.static.embedding import ( AutoStaticWordEmbedding, ) from embeddings.embedding.static.fasttext import KGR10FastTextConfig from embeddings.evaluator.text_classification_evaluator import TextClassificationEvaluator from embeddings.model.flair_model import FlairModel from embeddings.pipeline.standard_pipeline import StandardPipeline from embeddings.task.flair_task.text_classification import TextClassification from embeddings.transformation.flair_transformation.classification_corpus_transformation import ( ClassificationCorpusTransformation, ) app = typer.Typer() def run( embedding_name: str = typer.Option( "clarin-pl/fastText-kgr10", help="Hugging Face embedding model name or path." ), dataset_name: str = typer.Option( "clarin-pl/polemo2-official", help="Hugging Face dataset name or path." ), input_column_name: str = typer.Option( "text", help="Column name that contains text to classify." ), target_column_name: str = typer.Option( "target", help="Column name that contains label for classification." ), root: str = typer.Option(RESULTS_PATH.joinpath("document_classification")), ) -> None: typer.echo(pprint.pformat(locals())) output_path = Path(root, embedding_name, dataset_name) output_path.mkdir(parents=True, exist_ok=True) load_dataset_kwargs = { "train_domains": ["hotels", "medicine"], "dev_domains": ["hotels", "medicine"], "test_domains": ["hotels", "medicine"], "text_cfg": "text", } load_model_kwargs = { "word_dropout": 0.15000000000000002, "reproject_words": False, "locked_dropout": 0.1, "kernels": ((200, 4), (200, 5), (200, 6)), "dropout": 0.45, } task_train_kwargs = { "param_selection_mode": True, "max_epochs": 30, "mini_batch_size": 53, "learning_rate": 0.056229461940061544, "save_final_model": False, } dataset = HuggingFaceDataset(dataset_name, **load_dataset_kwargs if load_dataset_kwargs else {}) data_loader = HuggingFaceDataLoader() transformation = ClassificationCorpusTransformation(input_column_name, target_column_name) config = KGR10FastTextConfig(dimension=100) word_embedding = AutoStaticWordEmbedding.from_config(config) embedding = FlairDocumentCNNEmbeddings(word_embedding, **load_model_kwargs) task = TextClassification( output_path, task_model_kwargs={}, task_train_kwargs=task_train_kwargs ) model = FlairModel(embedding, task) evaluator = TextClassificationEvaluator() pipeline = StandardPipeline(dataset, data_loader, transformation, model, evaluator) result = pipeline.run() typer.echo(pprint.pformat(result)) typer.run(run) ```
djaniak commented 2 years ago

The problem is the implementation of LockedDropout in flair library; CNN outputs the tensor of shape batchsize x CNN tensor size which is 2-dimensional. However, the implementation if LockedDropout takes 3-dimensional tensor as input.

Temporary solution would be not to use LockedDropout with FlairDocumentCNNEmbeddings and wait for the fix

djaniak commented 2 years ago
flair/embeddings/document ![image](https://user-images.githubusercontent.com/26749468/147564568-c1752f26-e91f-48bb-bea6-1b6cefe84f76.png)
LockedDropout ![image](https://user-images.githubusercontent.com/26749468/147564190-f5524123-4225-41a4-8508-ef3bf89d9772.png)