BenjaminMidtvedt commented 9 months ago

This PR contains the changes for the next major release. The focus of this release is PyTorch / Deeplay integration, performance optimization, and a more rigorous way of utilizing static data from disk.

The breaking changes

`Image`

Pipelines no longer return Image objects per default. We chose to make this change mainly to improve performance, but also since it was prohibitively difficult to keep Image compatible with other libraries.

If you do not call .get_properties(.) or look at the .properties attribute of the output of a pipeline, this change will not affect you.

If you do need Image, you can call pipeline.store_properties(). This will make the pipeline act justlike the prior release!

`Tensorflow`

Tensorflow has been removed as a dependency. Maintining a tensorflow dependency is no longer feasible since they do not support major python versions on windows at all. Moreover, due to layout changes in tensorflow, uninstalling or changing tensorflow versions can leave the package system in a unrepairable state. As such, I suggest we leave tensorflow installation to the user, which will rightly direct inevitable complaints to tensorflow instead of us.

Moreover, we will not actively support any tensorflow version newer than 2.10

Instead of `tensorflow`, `deeplay`!

Moving forward, we will use the torch-based library deeplay for our models. We are still in the process of making the transition, so some functionality is yet missing. However, in the long run, we expect deeplay to provide a much more flexible and powerful base to construct neural networks from.

What's new?

Global changes

Many submodules are now lazy loaded. This means they are only actually initialized if needed. Particularly the modules containing tensorflow code. The benefits are:

Significantly faster initial load.
Allows users that do not need the NN side of DeepTrack to omit tensorflow as a dependency
Ensures that TF does not hog all CUDA memory or cause interference with pytorch

Sources

Sources are a new way to operate on static datasets. They aim to solve a few common problems. Consider the following common pipeline:

train_paths = glob.glob("train/*")
test_paths = glob.glob("test/*")

train_image = dt.LoadImage(itertools.cycle(train_paths)) >> dt.NormalizeMinMax()
test_image = dt.LoadImage(itertools.cycle(test_paths)) >> dt.NormalizeMinMax()

augmented_train_image = dt.Reuse(train_image, 4) >> dt.FlipLR() >> dt.FlipUD()

Though very simple, this approach (which is the recommended approach) is actually very limited.

itertools.cycle means that every time you evaluate the pipeline, you change the state. This can be frustrating when visualizing data, where images drawn will change every time you run the cell, and will depend on how many times the pipeline was evaluated before.
There is no way to allow random access. If you want the 320th sample in the training dataset, you need to evaluate the feature that many times to access it. Assuming itertools.cycle is at the start. If it is not, you need to manually determine where it is in the cycle, and figure out how many times to evaluate to get to the desired point.
You need to manually calculate the total number of datapoints in your dataset if, for example, you want to evaluate on the entire test dataset, or create a training dataset that contains every image.
You often find that you need to maintain several pipelines, one for training, one for testing and one for validation. In this case, it's manageable, but for more complex pipelines it often leads to bugs.
The way Reuse works (stores the last used image 4 times here) means that the pipeline can feel "stuck". Say you have 32 augmented views of each image. This would mean you'd need to evaluate the feature 32 times to get a new source image. For visualization, this is painful.

All to say, the current approach is not optimized for static datasets.

Introducing `Sources`

Sources are, in brief, a way to separate the variables of a pipeline from the definition of the pipeline. The aim is to make the pipeline (as much as possible) functional. As in, only dependent on the direct input of the pipeline pipeline(source).

To introduce, here's the above pipeline using the new syntax:

train_paths = dt.sources.ImageFolder(root="train")
test_paths = dt.sources.ImageFolder(root="test")

train_sources = train_paths.product(flip_lr=[False, True], flip_ud=[False, True])
test_sources = test_paths.product(flip_lr=[False], flip_ud=[False])
sources = dt.sources.Sources(train_sources, test_sources)

pipeline = dt.LoadImage(sources.path) >> dt.FlipLR(sources.flip_lr) >> dt.FlipUD(sources.flip_ud)

train_paths = dt.sources.ImageFolder(root="train") is a source that finds the images in a root folder and its subfolders. It then exposes the attributes .path, .label, .label_name which a feature can depend on. These operate almost identically to properties. So dt.LoadImage(path=train_paths.path) would pull the active (I'll get back to what active means) path of train_paths when evaluated.
train_sources = train_paths.product(flip_lr=[False, True], flip_ud=[False, True]): .product mirrors the itertools.product behavior. It creates a new source with the extra attributes flip_lr and flip_ud, combined with the original source like a nested for-loop:
```
for path in paths:
for flip_lr in [False, True]:
    for flip_ud in [False, True]:
         path, flip_lr, flip_ud
```
sources = dt.sources.Sources(train_sources, test_sources): Creates a unified accessor for the sources. It means that sources.path can pull from train_sources or test_sources depending which was last active.
pipeline = dt.LoadImage(sources.path) >> dt.FlipLR(sources.flip_lr) >> dt.FlipUD(sources.flip_ud): standard deeptrack pipeline. Note that we now only need to define one pipeline for the evaluation.

To evaluate the pipeline, we now simply do one of:

x = pipeline(train_sources[320])
# or
for source in test_sources[20:40]:
    image = pipeline(source)

if we want to iterate over paths, not augmentations, we can do

for source in train_paths:
    image = pipeline(source)

It should be clear that this solves all the issues from the current implementation. Moreover, this separation of logic allows for far more complexity, since we can define interesting ways of operating on sources that would not be possible on Features directly.

There are a few more points that I'll mention briefly. I've included a sources.NumpyRNG implementation, which is a source that can be used for seeded pipeline evaluation. Here, each source index has a unique seed.

Finally. There is a known bug. If using non-deterministic features like Gaussian, it will produce the same noise pattern for every source, unless you call .update() between evaluations. I have not decided how to solve this.

Pytorch integration

All pytorch code is currently in the lazy-loaded pytorch submodule. In the future, we might import some of it into the global namespace. Currently, we have two classes in pytorch:

`pytorch.Dataset`

Subclass of torch.data.utils.Dataset, which takes a deeptrack pipeline and creates a dataset that is compatible with standard DataLoaders. You either need to specify a length of the dataset, or preferrably a Source. Continuing the example from above:

train_dataset = dt.pytorch.Dataset(pipeline, train_sources)
test_dataset = dt.pytorch.Dataset(pipeline, test_sources)

`pytorch.ToTensor`

A convenience Feature that can (and should) be added to the end of pipelines to convert the output to pytorch tensors. Also supports setting the dtype which is critical since numpy default is float64 and pytorch (usually) expects float32.

`deeplay`

The deeplay library, if installed, can be accessed as deeptrack.deeplay.

BenjaminMidtvedt commented 9 months ago

Should close #172

giovannivolpe commented 9 months ago

Nice work!

Small comments:

I like the sources. I agree that the noise behavior should be updated automatically.
ToTensor is also good. Also good that it automatically converts to float32 (it should be the default)

About the speed enhancement, does it mean that the .get_property() method cannot be used internally, or in general? Because we are relying on it in some of the examples. If the plan is to remove, we should avoid it completely. We can discuss the details in person.

BenjaminMidtvedt commented 9 months ago

Nice work!

Small comments:
1. I like the sources. I agree that the noise behavior should be updated automatically.

2. ToTensor is also good. Also good that it automatically converts to float32 (it should be the default)
About the speed enhancement, does it mean that the .get_property() method cannot be used internally, or in general? Because we are relying on it in some of the examples. If the plan is to remove, we should avoid it completely. We can discuss the details in person.

You're right. I think torch has some api like torch.default_float() or something that returns the dtype expected by default. We can use that as the default out of ToTensor.

In general, sadly. My best case would be to remove it. But that might be too much of a breaking change.

giovannivolpe commented 9 months ago

@BenjaminMidtvedt I get this warning: DeepTrack-2.0/deeptrack/scatterers.py:100: SyntaxWarning: "is not" with a literal. Did you mean "!="? if upsample is not 1: # noqa: F632

I guess, "# noqa: F632" should be removed from line 100 of scatterers.py?

BenjaminMidtvedt commented 9 months ago

@giovannivolpe It's a frustrating warning because it is actually important that it is a "is not" instead of !=. If upsample is an array, != results in an another array instead True or False.

BenjaminMidtvedt commented 9 months ago

The noqa comment is to silence the linter about the same issue

DeepTrackAI / DeepTrack2

Bm/migrate to torch #199

The breaking changes

`Image`

`Tensorflow`

Instead of `tensorflow`, `deeplay`!

What's new?

Global changes

Sources

Introducing `Sources`

Pytorch integration

`pytorch.Dataset`

`pytorch.ToTensor`

`deeplay`

DeepTrackAI / DeepTrack2

Bm/migrate to torch #199

The breaking changes

Image

Tensorflow

Instead of tensorflow, deeplay!

What's new?

Global changes

Sources

Introducing Sources

Pytorch integration

pytorch.Dataset

pytorch.ToTensor

deeplay

`Image`

`Tensorflow`

Instead of `tensorflow`, `deeplay`!

Introducing `Sources`

`pytorch.Dataset`

`pytorch.ToTensor`

`deeplay`