Inconsistency in F1 metric between manual eval and Trainer.test() run

lillekemiker commented 3 years ago

🐛 Bug

When training a multilabel image classifier as described in the docs, (original link:https://lightning-flash.readthedocs.io/en/latest/reference/multi_label_classification.html),

import os.path as osp
from typing import List, Tuple

import pandas as pd
from torchmetrics import F1

import flash
from flash.core.classification import Labels
from flash.core.data.utils import download_data
from flash.image import ImageClassificationData, ImageClassifier
from flash.image.classification.data import ImageClassificationPreprocess

# 1. Download the data
# This is a subset of the movie poster genre prediction data set from the paper
# “Movie Genre Classification based on Poster Images with Deep Neural Networks” by Wei-Ta Chu and Hung-Jui Guo.
# Please consider citing their paper if you use it. More here: https://www.cs.ccu.edu.tw/~wtchu/projects/MoviePoster/
download_data("https://pl-flash-data.s3.amazonaws.com/movie_posters.zip", "data/")

# 2. Load the data
genres = ["Action", "Romance", "Crime", "Thriller", "Adventure"]

def load_data(data: str, root: str = 'data/movie_posters') -> Tuple[List[str], List[List[int]]]:
    metadata = pd.read_csv(osp.join(root, data, "metadata.csv"))
    return ([osp.join(root, data, row['Id'] + ".jpg") for _, row in metadata.iterrows()],
            [[int(row[genre]) for genre in genres] for _, row in metadata.iterrows()])

train_files, train_targets = load_data('train')
test_files, test_targets = load_data('test')

datamodule = ImageClassificationData.from_files(
    train_files=train_files,
    train_targets=train_targets,
    test_files=test_files,
    test_targets=test_targets,
    val_split=0.1,  # Use 10 % of the train dataset to generate validation one.
    image_size=(128, 128),
)

# 3. Build the model
model = ImageClassifier(
    backbone="resnet18",
    num_classes=len(genres),
    multi_label=True,
    metrics=F1(num_classes=len(genres)),
)

# 4. Create the trainer. Train on 2 gpus for 10 epochs.
trainer = flash.Trainer(max_epochs=10)

# 5. Train the model
trainer.finetune(model, datamodule=datamodule, strategy="freeze")

# 6. Predict what's on a few images!
# Serialize predictions as labels, low threshold to see more predictions.
model.serializer = Labels(genres, multi_label=True, threshold=0.25)

predictions = model.predict([
    "data/movie_posters/predict/tt0085318.jpg",
    "data/movie_posters/predict/tt0089461.jpg",
    "data/movie_posters/predict/tt0097179.jpg",
])

print(predictions)

# 7. Save it!
trainer.save_checkpoint("image_classification_multi_label_model.pt")

I get different F1 metrics for the test set depending on how I run the evaluation:

# Run test with trainer:

trainer.test(model, datamodule=datamodule)

# stdout:
# {'test_binary_cross_entropy_with_logits': 0.5449734330177307,
# 'test_f1': 0.46086955070495605}

# Run test manually:

metric = F1(num_classes=len(genres))

for batch in datamodule.test_dataloader():
    image_tensor = batch[DefaultDataKeys.INPUT]
    target = batch[DefaultDataKeys.TARGET]
    with torch.no_grad():
        y_hat = model(image_tensor)
    prediction = model.to_metrics_format(y_hat)
    metric(prediction, target)

print(metric.compute())

# stdout:
# tensor(0.3891)

To Reproduce

Steps to reproduce the behavior:

Copy paste the example training code from the link above
Add the test evaluation code above
Save and run the script
See error

Expected behavior

The two F1 metrics should be identical

Environment

PyTorch Version: 1.8.0
PyTorch-Lightning: 1.3.5
Lightning-Flash: 0.3.2
Torchmetrics: 0.3.2
OS (e.g., Linux): macOS
How you installed PyTorch (conda, pip, source): pip
Python version: 3.8.8
CUDA/cuDNN version: N/A
GPU models and configuration: None
Any other relevant information: None

Additional context

None

SkafteNicki commented 3 years ago

Hi @lillekemiker, Finally figured out what the issue is. So the difference lies in that a default transform is applied to datamodule.test_dataloader() when calling the trainer.test method (in the case of classification it is a standard normalization: https://github.com/PyTorchLightning/lightning-flash/blob/da684414f09cac8a65d412814491124343c8b416/flash/image/classification/transforms.py#L60) but when calling datamodule.test_dataloader() outside the trainer object it is not.

@tchaton is this the expected behaviour.

lillekemiker commented 3 years ago

Thank you!! This has been driving me crazy. Don't know if it is expected behavior but considering that transforms are defined on the datamodule level, I definitely expected them to be applied. How would I even turn them on outside the trainer?

ethanwharris commented 3 years ago

Hi @SkafteNicki @lillekemiker - this is definitely something we could do better. The challenge is that some of our transforms are applied in the dataloader and some are applied in the model. So at runtime we inject the transforms into the dataloader / model. There's a few options:

We could provide a public facing API for doing this injection. Something that takes a model and a dataloader and injects the transforms correctly.
We could just inject the dataloader transforms (pre_tensor_transform -> ... -> per_batch_transform) and maybe warn the user that any transforms that would usually be injected into the model won't be applied
We could just not apply any transforms but warn the user that none of the transforms they provide will be injected into the dataloader
Some combination of warning / injecting

Interested to hear your thoughts :smiley:

lillekemiker commented 3 years ago

What is the argument against having the dataloader handle all transforms without magic runtime injections?

ethanwharris commented 3 years ago

@lillekemiker because we support transforms on device. So if the user provides per_sample_transform_on_device or per_batch_transform_on_device they have to be injected into the model rather than the dataloader.

lillekemiker commented 3 years ago

I can see the logic behind this design decision. I don't agree with it, though, I don't think. As an abstraction, the dataloader should handle transforms. Technically, the dataloader often runs in multiple threads per GPU and so because you don't want multiple threads using the same GPU, you moved the GPU transforms into the model realm. Or at least I assume this is the reasoning. Am I missing anything? I think as a design decision, it is more important to keep the dataloader/model contact surface clean. Couldn't GPU transforms be applied in the same thread as the model before handing off the batch to the model rather than after? A GPU threadlock is another option but that might come with a performance penalty.

As such, transforms could live in the model, too, and simply be part of the model. I don't see anything conceptually wrong with that.

lillekemiker commented 3 years ago

One major issue with the current way of doing it is that depending on whether or not Kornia is installed, the tensor normalization may be happening in per_batch_transform_on_device or in post_tensor_transform and so may or may not have happened in the manual case without a Trainer.

SkafteNicki commented 3 years ago

IMO if this was barebone lightning then I would expect that a batch looked the same regardless if it is inside the lightning trainer or outside, because lightning is "just reorganised pytorch code". However, if this also should be the case in flash I am not completely sure about. It depends completely on the design philosophy of flash. Since it is at an higher abstraction than lightning, I am fine with this not being supported. @ethanwharris is it possible to extract the data pipeline such that it would be possible to do something like

model(pipeline(batch))

lillekemiker commented 3 years ago

I may also be a bit too unclear about Flash's design philosophy. Either way, though, if the dataloader is exposed to the end user (i.e. me), then I would expect it to either always apply all transforms or never apply any transforms, and that it would be consistent in its behavior. If the dataloader is an internal and I'm never supposed to see it, then I guess I shouldn't really care :)

As for the design philosophy, my personal vote would be that flash should be highly modular and made in a way making it easy to dismantle and replace parts of it with custom code. That way it provides a quick and easy baseline model with very little coding, but more importantly, once you need to move beyond the baseline model, you don't have to start over and rewrite the whole thing in lightning yourself. You just replace the parts that need replacing. If that is not what you guys are going for, though, that is also completely fair. Blackbox, off-the-shelf deep learning solutions definitely have their place too.

ethanwharris commented 3 years ago

@SkafteNicki that could be possible. I think the issue with what we have now is that we inject the transforms into the correct places inside the trainer. What we could definitely do is expose a method like:

dataloader = datamodule.train_dataloader()
model, dataloader = datamodule.inject_transforms(model, dataloader)

If we documented that as the recommended way to use the flash datamodules without a lightning Trainer, I guess that would address some of the issues here? @tchaton Interested to hear your thoughts

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Lightning-Universe / lightning-flash