Pytorch training - Githubissues

ahmed-alhindawi commented 4 years ago

Good evening all,

Welcome to amateur rebasing hour. In an effort to keep pytorch_training branch up to date with the recent pull requests, I've made several rebases/commits/pushes and I have no idea how they've all stacked up so high, but the end result is that they all work (i think by chance!)

Either way; this branch does several things:

Generate Left/Eye right patches using our new pipeline with new face detector/new eye patches into the inpainted folder of the rt_gene dataset (called left_new and right_new)
Provide a way to generate, an optionally mega augmented, h5 file that contains the entire dataset with labels in a training friendly way. This is akin to the prepare_dataset.m....I've left the head_pose label as the one generated from the data collection rather than from the pipeline - the difference per subject is ~5 degrees.
Provide several models that are inline with the rt_gene model; Resnet18, Resnet50, VGG16, MobilenetV2, Shufflenet, MNAS, and ResneXt-50 backends. I'm currently training. I will upload the accuracy of them all and how it compares to the pre-trained tensorflow model.
Provide training code that is optimised for rt_gene training; a batch size suitable for a Tesla V100, learning rate based on a learning_rate_finder (in pytorch.utils if interested). This uses pytorch_lightning to standardise the training loop which is now a dependency unfortunately if you wish to train a pytorch model.
For inference, this branch supplies two pathways, for tensorflow and for pytorch in a way that doesn't contaminate the global namespace; i.e. it doesn't load both of them.

I have tested this code and it runs both the tensorflow and the pytorch paths correctly. Let me know what you think :)

Fix #46

ahmed-alhindawi commented 4 years ago

Some notes on model inference across different backends. This is purely for inference on the model - i.e. running the same patches/headpose that are already on the GPU over 5000 instances and averaging the frequency of model inference. Memory usage is from nvidia-smi. This test can be seen in the gaze_estimation_models_pytorch.py

Backend	Frequency	Memory usage (MiB)
Resnet-50	70Hz	1591
ResneXt-50	45Hz	1651
Resnet-18	160Hz	1219
VGG-16	175Hz	1857
MobilenetV2	75Hz	1259
MNAS	80Hz	1233
Shufflenet	60Hz	1119

Seems that VGG-16/Resnet-18 are quite equal in terms of inference time but Resnet-18 has lower usage. Will update on accuracy following some more training

Tobias-Fischer commented 4 years ago

Some more remarks:

"Generate Left/Eye right patches using our new pipeline with new face detector/new eye patches into the inpainted folder of the rt_gene dataset (called left_new and right_new)" -> I think this again needs documentation; are these patches then used for training RT-GENE?

Other questions:

Is the inference now faster when running the gaze estimation and blink estimation at the same time?
In the future, do you plan a PyTorch version of RT-BENE?
Can the training be run on a "normal" GPU? What is the minimum requirement? Would it train on something like a 1070?
When using the PyTorch backend, can we get rid of the tensorflow dependency?

ahmed-alhindawi commented 4 years ago

Some more remarks:

"Generate Left/Eye right patches using our new pipeline with new face detector/new eye patches into the inpainted folder of the rt_gene dataset (called left_new and right_new)" -> I think this again needs documentation; are these patches then used for training RT-GENE?

Yes, they are - there is a possibility of merging the H5 dataset generation with this but it would be convoluted and not very modular. The GenerateEyePatchesDataset.py uses the new face detector and landmark extractor to extract the eye patches into left_new and right_new per inpainted subject. Those two folders per subject, are then used alongside the label_combined.txt to generate the H5 dataset using GenerateRTGeneH5Dataset.py The reason I did this is because we lose some data with the new patch extraction technique - around 0.5 - 1% of data; i.e. the new pipeline doesn't think there is an eye patch there but the old pipeline did. I wanted to give the user/trainer an option of using the new pipeline dataset that has fewer samples or the older dataset that has more samples. I've documented the stages required to get the training underway in the README.md

Is the inference now faster when running the gaze estimation and blink estimation at the same time?

Not sure yet, still working on the models. Getting VGG-16 to train takes a long time compared to Resnet...

In the future, do you plan a PyTorch version of RT-BENE?

Yes.

Can the training be run on a "normal" GPU? What is the minimum requirement? Would it train on something like a 1070?

Oh yes, it trains fine, just with a smaller batch size that's all.

When using the PyTorch backend, can we get rid of the tensorflow dependency?

Yes, the pipeline (besides the blink estimation) wouldn't require tensorflow and thus can be removed as a dependency.

ahmed-alhindawi commented 4 years ago

Before merging, I think we should briefly mention in the appropriate README files that there are two ways of doing training/inference now.

Agreed but can we hold off until I have the models fully trained and in deployable storage? I don't want a user to think they can run on pytorch and then not have any trained models

Tobias-Fischer commented 4 years ago

Looks pretty much ready to merge now. I agree that it's best to wait until the models are trained. Many thanks again!

Tobias-Fischer commented 4 years ago

Ahhh one thing: Do you have a script that does k-fold evaluation, too? Something equivalent to https://github.com/Tobias-Fischer/rt_gene/blob/pytorch_training/rt_gene_model_training/tensorflow/evaluate_model.py?

ahmed-alhindawi commented 4 years ago

Ahhh one thing: Do you have a script that does k-fold evaluation, too? Something equivalent to https://github.com/Tobias-Fischer/rt_gene/blob/pytorch_training/rt_gene_model_training/tensorflow/evaluate_model.py?

Nope - will create one as soon as I can.

Tobias-Fischer commented 4 years ago

This PR fixes #46

ahmed-alhindawi commented 4 years ago

Sorry it's taken me a while; each model takes several days to train but now we have 4 models (VGG) and thus the pytorch branch is now at feature parity with the tensorflow.

Tobias-Fischer commented 4 years ago

Finally merged - many thanks @ahmed-alhindawi! Great work.

Tobias-Fischer / rt_gene

Pytorch training #63