Feature suggestions - Githubissues

pfeatherstone commented 3 years ago

Just putting some suggestions out there. Maybe we could organize these into projects.

[x] support scaled yolov4
[x] support yolov5 models
[x] add loss layer for training yolo models
[x] Create a fuse_conv_batchnorm visitor for enhanced performance during inference
[ ] bipartite matching loss (Hungarian algorithm)
[ ] transformer (for Detr like model)
[ ] add GIoU, DIoU, CIoU options to loss_yolo
[ ] see what happens when pre-training the (darkner53) backbone with Barlow twins loss using unlabelled imagenet, then fine tuning neck and heads with loss_yolo. Hopefully training the backbone with an unsupervised loss gives you better features than one trained specifically for classification. Presumably, with the backbone being frozen and therefore not requiring any gradient computation or batch-normalisation, this should accelerate training and reduce VRAM right ?
[ ] Add cutmix augmentation
[ ] Enhance CPU with NNPACK or oneDNN

pfeatherstone commented 3 years ago

Oh wait I see, the visitor manifested a potential latency with tag layers

pfeatherstone commented 3 years ago

Yeah I don't think saying -O3 will optimise away all the templates is accurate.

pfeatherstone commented 3 years ago

Need to run the model through a proper c++ profiler to quantify and qualify these latencies. I had no luck with gperf last time and Google's orbit just didn't build for me

arrufat commented 3 years ago

I think this was the first time that I had to wrap an input layer with a tag. In the other cases, I never experienced any VRAM fluctuation. But I never tested added 100 tags in a row to a network and compare the performance...

davisking commented 3 years ago

All the templates don't have any runtime component, so lots of templates isn't causing runtime slowness. However, if you add a tag layer on top of an input layer you are asking that tag layer to make a copy of the inputs so that you can later call get_output() on the tag, which is giving you a copy of the input tensor. That copy cost something. Without the tag no copy of the input tensor is made or retained in any form, and so it's faster. But you then can't go asking the network later what the inputs were.

arrufat commented 3 years ago

Ah, it makes sense, thank you :)

arrufat commented 3 years ago

@pfeatherstone another thing worth trying would be to implement a nearest-neighbor up-sampling layer. (Darknet uses this kind of interpolation, but dlib uses bilinear interpolation, which certainly does not help in terms of speed)

pfeatherstone commented 3 years ago

Yes! And it would match darknet's output more closely, which is desirable

pfeatherstone commented 3 years ago

But there are only 3 up sampling layers. So I doubt it will make much of a difference in terms of performance

pfeatherstone commented 3 years ago

Maybe it's this business with tag layers that's causing latencies.

davisking commented 3 years ago

Only tag layers at the input have any runtime component. So adding tags elsewhere will not change the speed.

pfeatherstone commented 3 years ago

Then I dunno where the latencies are.

pfeatherstone commented 2 years ago

I think some of the issues above have been addressed right? With the introduction of dlib::fuse_layers()? I don't know if it's worth doing more benchmarks to see if dlib is creeping up on darknet/pytorch performance?

I don't know if this is the right place to suggest stuff, maybe a discussion group on the main dlib page would be more appropriate. But, on the subject of feature suggestions, did anyone have a look at pix2seq ? It's a really cool new approach to object detection. It's using transformers like Detr but in a more NLP way. Really cool. I really like these intuitive objective functions like bipartite matching as well. They are a bit less faffy than yolo losses.

This kind of stuff would require transformers to be added to dlib. That seems like a big piece of work. You need stuff like positional embeddings, tokenizers, multi-head attention, etc. But if it did make its way in, that would be cool.

arrufat commented 2 years ago

I implemented the YOLOR-P6 architecture (very similar to the YOLOv4-CSP family) except for the implicit addition and multiplication layers (which I didn't implement) and the speed seems comparable to that of PyTorch (from the official repository). I get about 50 FPS on PyTorch and 45 for dlib with an input size of 640×640 on a 1080Ti. Differences might be from the bilinear interpolation in the upsampling layers and the very first convolutional layer, which YOLOR has replaced with a reorg layer.

I did have a look at DETR, but haven't read the pix2seq paper yet. Having transformers on dlib it's something I'd like to do at some point, but it's not my main focus these days, sadly enough.

pfeatherstone commented 2 years ago

I think the bipartite matching loss described in DETR wouldn't be too hard and would be a great addition to dlib. It's a great way to do direct set prediction. The main component is the Hungarian algorithm which dlib already has implemented : dlib::max_cost_assignment(). You have to be careful when using it though I think. I think it only makes sense when your model has global attention.

arrufat commented 2 years ago

@pfeatherstone check this out! I've added matching definitions for all YOLOv5 models in dlibml/dnn. The only thing that's missing is the SILU activation (which is quite easy to add). I will probably make a PR to dlib one of these days.

I still think dlib is unrivaled in terms of defining neural networks compactly.

pfeatherstone commented 2 years ago

@arrufat yeah that looks cool! Really great work. I mean, I still prefer the PyTorch syntax to be honest. I think MNN from Alibaba have a similar definition. And tensorflow 2 has gone that way too. I find it easier to follow and to reason about. Also, the super templated architecture makes it take far too long to compile. I find it's great for deployment but not great for research. But I'm a big fan of dlib in general and I'm all for adding stuff that will attract more users and ultimately make dlib better with more PRs. I still think you're the biggest dlib DNN user at the moment

pfeatherstone commented 2 years ago

You know how yolov5 had a competition for exports, you could add a dlib export using a visitor or something. Lol

pfeatherstone commented 2 years ago

I wonder if c++20 modules will help with dlib DNN compilation. My understanding is that templates are only passed once or something like that. I could be totally wrong though. Like, at the moment, a convolutional layer is probably compiled several hundred times because of all the templates. So maybe compilation will improve with future standards

arrufat commented 2 years ago

You know how yolov5 had a competition for exports, you could add a dlib export using a visitor or something. Lol

Oh, I didn't know that. But having interoperability with other frameworks is something I'm really interested in. I made that dot visitor to make sure I could parse the network correctly, and get where all tensors go and come from, but my ultimate (longterm) goal, would be a visitor to export to onnx. However I still need to spend some time reading the onnx format to be able to do it.

arrufat commented 2 years ago

I wonder if c++20 modules will help with dlib DNN compilation. My understanding is that templates are only passed once or something like that. I could be totally wrong though. Like, at the moment, a convolutional layer is probably compiled several hundred times because of all the templates. So maybe compilation will improve with future standards

Yes, I've also wondered how modules could improve dlib's dnn toolkit. It would be fun to release dlib v20 which required C++20.

pfeatherstone commented 2 years ago

Or maybe directly have an ONNX inference engine in dlib

pfeatherstone commented 2 years ago

That way could train in PyTorch or tensorflow and infer with dlib which has less dependencies

arrufat commented 2 years ago

Or maybe directly have an ONNX inference engine in dlib

I thought about using OneDNN, instead. I think just replacing the CPU implementation for the convolution (when OneDNN is available) would be a huge improvement.

pfeatherstone commented 2 years ago

I wonder if c++20 modules will help with dlib DNN compilation. My understanding is that templates are only passed once or something like that. I could be totally wrong though. Like, at the moment, a convolutional layer is probably compiled several hundred times because of all the templates. So maybe compilation will improve with future standards

Yes, I've also wondered how modules could improve dlib's dnn toolkit. It would be fun to release dlib v20 which required C++20.

Maybe there could be a dnn2 namespace which uses modules and is optionally compiled in cmake. Cmake would have to support modules though, which I don't think it does

pfeatherstone commented 2 years ago

I was reading up on the Eigen library. That has a tensor module, originally written by the tensorflow team, which supports all the usual tensor ops including broadcasting, lazy evaluation, etc, AND, it supports CUDA and SYCL. Eigen is optimised for pretty much every CPU as well. I think leveraging Eigen would require a new high level API. I don't think it would be suitable fitting it in the current dnn module.

davisking commented 2 years ago

Heh, I am not a fan of Eigen to put it mildly.

pfeatherstone commented 2 years ago

@davisking really? I think I'm aware of most of your concerns but I think they are largely dealt with now. It has a cost model, similar to dlib's in dlib::matrix, so it knows when to insert temporaries, and likewise inserts temporaries for ops like multiplication and convolution. So it seems pretty safe now.

pfeatherstone commented 2 years ago

I could be wrong. There is still a section on when to use eval() and noalias() but I think in most cases you're fine. I think you have to know when to use eval() in ops like sum() or maximum() in long expressions. Not because it yields incorrect results, but could be slow. I've used it in a few places and it was totally fine and actually blazing fast, particular on ARM architectures where other frameworks aren't so optimised.

pfeatherstone commented 2 years ago

@davisking I would be interested in reading your thoughts on all the dlib DNN stuff, even if you're biased being its author. Do you still use it in anger, or do you use other frameworks? Do you have a roadmap for where it could go or a new API if you had the time to work on it ? What's your thoughts on the compilation problem with the current API? Does it bother you or not at all ? My current blocker with the API is that writing a new loss function or module is not so trivial whereas in torch it's painfully simple. The main reason is that it has really good autograd support. Do you foresee autograd being a good addition to dlib to help with that? JAX has another really nice approach. Do you have any thoughts on that. A general brain dump would be a good read if you had the time.

davisking commented 2 years ago

On Eigen

Eigen doesn't really know when to insert temporaries. Go run temp = temp.transpose() or any variety of expressions that involve the same matrix on the left and right hand sides of the = and see what happens. It is fairly magical to know exactly when the result will be garbage or not garbage. So that's one thing. I view silently yielding incorrect results extremely negatively. Especially when it shouldn't be hard to do it right (dlib does it right for instance and last I looked without any performance impact).

I think all this roots back to the really horrific mess that is Eigen's codebase. The whole thing where all matrix expressions have to be member functions of the mega matrix class is crazy. It makes it prohibitively hard for Eigen to do things well (like aliasing). It also makes it basically impossible for third parties to write their own matrix expressions. Again, that's something that should be easy. But the design of Eigen flagrantly violates basic software engineering practices like the Open Closed principle. In particular, it should be possible to make new matrix expressions without modifying Eigen, but it isn't (I know about the macro you can edit that causes another file to be pasted into the middle of the Eigen matrix class to sort of allow a third party to "extend" Eigen, that is a gnarly hack).

Another thing is the benchmarks claiming Eigen is fast are largely specious. Eigen is most definitely not faster at doing things like matrix multiplication than BLAS libraries like the IntelMKL. But IDK, maybe it has an ARM pathway that's useful for some users. I haven't used it there. I have used Eigen in real code though (because other people than me initially started those codebases, obviously in the cases where I was at the start of things we didn't use Eigen ;) ) but admittedly not on an ARM. So I can't speak to that.

There are other problems I've had with Eigen that are escaping me at the moment. Odd behavior of some of the linear algebra tooling I think. I also think the Eigen documentation is bad. It is not at all easy to find out what methods are available and how exactly to use them, what their contracts are, and so forth. Yes I know about the quick reference guide and all that. But from seeing dlib, you should have a sense for what I consider a proper documented API contract, and Eigen certainly does not live up to those expectations ;)

On Deep Learning in dlib

Heh, yeah so I use torch at work and I think torch is great. I'm a big advocate of torch. It's really well put together. Frankly at the time I made the DNN tooling in dlib I was optimizing for slightly the wrong things (obviously). In retrospect I should have made something that was more runtime defined. Basically I would make a light weight version of torch with a nicer C++ API (the torch C++ API is ok though, not great, but it doesn't offend me, which is saying a lot). I think torch has won the battle for the deep learning tooling space. I've also used tensorflow a bunch and my opinions on tensorflow are wildly negative. I have a bet with a friend that at some point google will switch to torch. We will see if I win that bet :)

Anyway, I'm not sure what I'm going to do with dlib in the future. I'm way too absorbed in working on self driving right now. Any spare time I have goes into that. But generally speaking, I will certainly return to more open source development at some point. Although I'm not sure it would be to polish deep learning software. Who knows. The stuff that I enjoy is generally more hardcore numerical optimization and statistics kind of stuff. Deep learning is many things but it is not that. But we will see where the ML field goes. I'm still holding out hope that something better than deep learning will surface, since deep learning is at some level a really dumb brute force technique. And the world of ML is much more vast than DL. It's just that in some industries applications of DL are ascendant right now. But the slowness of DL parameter estimation, the inefficient use of data, and the wild computational expense of it is galling.

Anyway, who knows. I am not much for having a plan :)

pfeatherstone commented 2 years ago

Thanks for your reply. Interesting to read.

Eigen

I see your points. Whenever I use the same matrix on the LHS and RHS, I basically wrap the whole RHS with .eval() and then the whole problem goes away (I'm fairly sure). Otherwise I use a different matrix on the LHS. With regards to the implementation details, I don't really care because client code shouldn't really concern itself with that. So long as the API is good, which I believe it is, then I'm happy. But I see where you're coming from. The fact it supports multiple architectures, even GPU now, supports tensors of arbitrary rank, and all of that is header-only is a huge bonus I think. It's the closest thing to a c++ numpy library. I know there is xtensor but it's quite slow. Like adding two arrays is still faster using std::transform with std::plus compared to using the + operator on two xtensor arrays. So not a fan.

Torch

Yeah I'm a massive fan. However I don't like the C++ library as it is a massive dependency and the cmake integration doesn't always work for me. I prefer training in PyTorch, exporting to ONNX and inferring with onnxruntime. It's still a large library but compiling it from source is trivial and you can statically link everything. I really hope the PNNX converter in the NCNN library gets better as that is a really easy library to build and integrate. I'm a fan of that one too.

Deep learning

Yeah I agree, I don't like the fact you need to show a classifier a picture of a dog millions of times for it to understand what a dog is. It's stupid and obviously not right. I think transfer learning is going to be more heavily used in the future and likely un-(semi) supervised learning on un-labelled datasets will be key. We can't spend our lives labelling stuff. I think improvements in optimisers will also be required. Maybe using hessians (2nd order terms) will be cool. But I hear it's infeasible due to memory. Maybe in 10 years time when nvidia deploys GPUs with 1TB memory it won't be a problem. Hopefully we will be able to train useful models on a handful of samples, not billions. It's a problem at work. Clients ask: "Can we do this?". The answer is yes but we can't/don't want to spend months labelling. Also, you would think labelling could be done by cheap labour (students, grads, etc) but they rarely do a good job, don't pay attention and then we have to curate it several times. It's a shit, time-consuming job that no one wants to do.

For me, I tend to work on high quality synthetic data which is naturally auto-labelled and allows us to train on. Surprisingly, it works in the real world too.

Closing thoughts:

I'm also waiting for something better to come along...

pfeatherstone commented 2 years ago

Oh and finally, with regards to future dlib works, attracting more users and making it better, it seems that most of dlib's success is due to the face recognition stuff. A lot of the GitHub issues relate to python bindings and face recognition, implying that's what a lot of users are using it for. So it seems that what would attract more users isn't more utility code, but application code. For example, if the COCO SOTA model was trained in dlib, guaranteed you would get another 10k stars within a year. I know you shouldn't open source capability for the sake of it, nor should open source be a free software dev service, but it's something to consider when trying to get more users. But maybe that's of absolutely no concern. But usually more users means more peer review, more PRs, better libraries, better code. So if anybody out there has trained a really cool model with dlib, consider publishing and open sourcing.

davisking commented 2 years ago

Have you used eigen gpu code? I would not use eigen on the GPU. Like what does Eigen do on the GPU?

Linear algebra is really a different thing than doing stuff with N-D arrays. Like numpy is fine for N-D array stuff, but I think numpy kinda sucks for linear algebra. I think Eigen is better, kinda, than numpy for linear algebra. But like I said, there are a bunch of issues. Anyway, you are talking about a bunch of different things. Like dlib::matrix is all about linear algebra, which fundamentally makes it a thing that talks about vectors and linear operators (all finite linear operators can be represented as a dlib::matrix, so within the context of linear algebra you don't want more than two dimensions). Heh, and don't get me started on tensors. We all say tensor now in the deep learning space, but before tensorflow "tensor" had a really specific meaning, which is a multi-linear map (see https://en.wikipedia.org/wiki/Tensor), which is not at all what all these tensor libraries are dealing in.

Anyway, you really should care about the code quality behind eigen. It for instance prevents you from using stuff like the IntelMKL (or whatever other high quality BLAS and LAPACK backend you might like). So any large scale linear algebra done on Eigen ends up being terribly slow. Like go try computing an SVD with Eigen on a matrix that's 10000x10000 and see how long it takes. Then build dlib against the IntelMKL and call dlib::svd() and see the difference. It will not be small, especially if done on a powerful computer with a lot of cores. And as an aside, you should paste here the code you had to write to do a SVD in Eigen. The APIs for doing matrix decompositions in Eigen are crazy.

Anyway, I don't get the impression you are after linear algebra though. Like my complains about numpy and Eigen are that they are crappy linear algebra libraries. But numpy is hardly trying to be a linear algebra library. It's trying to be a big vectorized "do random things with arrays" library in python. Which is a very python thing since you can't write loops in python, since python is too slow. So you need a library that makes it easy to write "vectorized" code in python that's able to do mostly whatever you want. And for that numpy is great. That is, it's great within the limitations of something that has to be rendered to the user as a python API and is 99% not about linear algebra.

I think you should try torch in C++. It's great. Save a torchscript model and build libtorch as a shared library. Works great.

You can use better solvers now that use the hessian and all kinds of other clever approaches that don't require huge amounts of RAM. The most textbook is LBFGS, but there are others that are more powerful and still don't blow out a computer's RAM. There are whole textbooks in numerical optimization packed full of powerful methods can can do it. The problem is they result in models that are badly overfit. There is some often hand waved at but not really understood by anyone regularizing property of SGD like solvers. The fact that SGD barely works as a solver helps a lot in avoiding overfitting. This has been known for decades, and during the early 2000s was one of the negs against neural networks. Since clearly these networks don't even know what they are optimizing. Since if you write down the nominal objective function used by deep models, and go minimize it with a real solver (i.e. any solver you might find in a book like "Numerical Optimization by Jorge Nocedal and Stephen Wright" say) you will find it quickly locates some low training error solution but when applied to test data it's just awful.

But yeah, I don't know what the next thing will be. It's not going to be DL though. Like you said, the data labeling requirement is crazy. Heh, when my kid was 3 years old I took him to the library one time and they were doing this thing where they take an overhead projector and put objects on it, casting their shadows onto the wall. Then they were asking the little kids what it was. So they would put a screw driver on it and the kids look at the pattern of shadow on the wall and go "that's a screw driver!". Or even more astoundingly (from a DL computer-vision perspective), the put a stylized stencil of a owl or dinosaur or whatever on it, that gets mapped to this grayscale rendering onto the wall, and the children, who have never in their lives seen anything like that kind of visual input, nevertheless all immediately go "that's an owl!". That kind of thing is far out of reach of current computer vision, much to my chagrin.

Another problem with DL is the inference process is mad slow. Like if you wanted to make a CNN that was equivalent to running HOG followed by a linear filter (i.e. what get_frontal_face_detector() does)` it would be profoundly slower to execute on a computer because of how convs are implemented. Like go look at the code for HOG filtering and then go look at code that does convs and relu. The wasted compute is just crazy.

davisking commented 2 years ago

Oh and finally, with regards to future dlib works, attracting more users and making it better, it seems that most of dlib's success is due to the face recognition stuff.

Heh, yeah, and that's just meant to be an example program/docs. There are a lot of quiet users though who use other things in dlib. That's been the case for a long time. But yeah, the face stuff gets the most clicks.

A lot of the GitHub issues relate to python bindings and face recognition, implying that's what a lot of users are using it for.

Maybe. That's what a lot of people who are new to programming are using dlib for. But there are a lot of professionals using dlib to and they don't have those problems. 99% of those issues are of the form "I don't know how to install visual studio and I'm not sure what a compiler is". Which is understandable for someone who is just getting into programming and has only interacted with python. That's never been the target audience of dlib (at least not for me anyway).

So it seems that what would attract more users isn't more utility code, but application code. For example, if the COCO SOTA model was trained in dlib, guaranteed you would get another 10k stars within a year. I know you shouldn't open source capability for the sake of it, nor should open source be a free software dev service, but it's something to consider when trying to get more users. But maybe that's of absolutely no concern. But usually more users means more peer review, more PRs, better libraries, better code. So if anybody out there has trained a really cool model with dlib, consider publishing and open sourcing.

I don't care about getting more users. Never have. If I had I would have tried to push dlib's linear algebra library, which predates Eigen, and Eigen likely wouldn't exist at all :)

What I want is more useful free tools for people to use. And in particular, more widely applicable general purpose powerful tools. Like for me, the stuff I like the most about dlib are the linear algebra and numerical optimization tools. Those are very powerful general tools that lots of people use on a very wide range of problems. And importantly it let's them solve problems they otherwise would just not be able to solve. They are also durable time tested things. Like LBFGS is going to be with humans for along time. Some algorithms are just really great and just the answer to a range of problems. One of the things that bothers me about deep learning is I constantly think it's going to be replaced by something fundamentally better. But BFGS, FFTs, SVDs, QP solvers, and stuff like that are forever :). In the distant future, when humans have space ships on the other side of the galaxy, those ships will have code in them for doing matrix decompositions and solving QPs. I very much doubt they will have some CNN/SGD stuff in them.

davisking commented 2 years ago

Like here are two cool things I would like to put in dlib when I get some time. I would like a really general purpose sequential quadratic programming solver. That would be super great. And I would also like more optimal control tooling. In particular, there is the MPC stuff in dlib now, but that could be fleshed out a lot. And for instance, solving a MPC problem in real time using an actual optimizer is hard and slow. But for a non-trivial set of domains, you can actually just precompute all the inputs and outputs that could ever happen and store them in some appropriate table. Or in some other compact form.

This is where someone says "train a DNN to mimic the MPC solver". Which I am not a fan of. What's good is if there is a clear mathematical proof that bounds the error between what an optimal MPC solver would yield and what the super fast and compact table/whatever-it-is outputs. This is easy in some cases where you can literally just have an interpolated lookup table. But it can be more complicated. Like it's important that some large powerful machine that's being controlled not have some weird mode it can get into where it goes crazy. Which is what you would get with some hand wavy "the DNN is replicating it x% of the time" kind of thing.

arrufat commented 2 years ago

As always, Davis, thank you so much for providing us with these insights. The problem I have with dlib, is that it has spoiled me: whenever I use some other library, I expect the same amount of thought put into their APIs and code quality. So, I try to add that missing function I needed to dlib. Moreover, working on ML using C++, I hardly ever need any dependency besides dlib, it has all the needed tools, and the missing ones are easy to build using existing functionality. In Python, it's a completely different story…

Currently, we use deep learning at work a lot, so that's why I contribute almost exclusively to that part of the library. If you think I am cluttering the dnn part, please, let me know, I understand that you might not want all the PRs I submit, no hard feelings :).

I am also curious to see where ML leads us. Hopefully, next time I won't arrive late to the party (like what happened with Deep Learning in dlib) and I will be able to contribute to the future of ML since the beginning :)

pfeatherstone commented 2 years ago

I do enjoy reading software rants :) I'm going to read about LBFGS as you mentioned it a couple times and therefore must be useful. With regards to torch, I've tried building it from source a couple times and failed. It's also super hard to cross compile. I tried building it in qemu and on a RPI4 and failed. I'll try again. Honestly, I think onnxruntime is easier and faster (than PyTorch. Don't know about libtorch). You're comment about your child being an expert computer vision algorithm is telling. I do wonder if that is transfer learning on unsupervised data though. Learn from experience... But don't know. It could be he/she is an expert solver already :)

arrufat commented 2 years ago

I do enjoy reading software rants :)

Oh, so I am not the only one, then :P

I've used LBFGS from dlib in the past to find the optimal temperature scaling parameter of a softmax for model calibration, and it's outstandingly fast (basically, I reimplemented this in dlib, and it was night and day) At that point I contributed several features I needed:

loading/saving a dlib::matrix from disk in python
- In my company, models are mostly trained with PyTorch, so I wrote the calibration tooling in C++ and dlib, and wrapped it using pybind11.
soft_max for dlib::matrix

davisking commented 2 years ago

Currently, we use deep learning at work a lot, so that's why I contribute almost exclusively to that part of the library. If you think I am cluttering the dnn part, please, let me know, I understand that you might not want all the PRs I submit, no hard feelings :).

Na, your PRs are great. Keep them coming :)

And honestly, if I was not busy with other stuff right now I would probably add a runtime auto-diff thing to dlib so that DNNs could be defined at runtime (i.e. without the template stuff that makes compile times long and admittedly is not as simple to use as a simple runtime auto-diff tool). I considered doing that in the beginning, but at the time (many years ago now) didn't think people would be interested in writing the wildly diverse networks that are common now. So something more narrow seemed sensible. But in retrospect a runtime auto-diff API would totally have been better. Like imagine some kind of generic API that uses type erasure to stack function calls on top of each other and keeps back pointers for traversal, while at the same time trivially leaving all the intermediate outputs as named variables in user code. So basically pytorch, but with a nicer C++ API and easy to build :)

I am also curious to see where ML leads us. Hopefully, next time I won't arrive late to the party (like what happened with Deep Learning in dlib) and I will be able to contribute to the future of ML since the beginning :)

The party never stops.

davisking commented 2 years ago

I do enjoy reading software rants :) I'm going to read about LBFGS as you mentioned it a couple times and therefore must be useful. With regards to torch, I've tried building it from source a couple times and failed. It's also super hard to cross compile. I tried building it in qemu and on a RPI4 and failed. I'll try again. Honestly, I think onnxruntime is easier and faster (than PyTorch. Don't know about libtorch). You're comment about your child being an expert computer vision algorithm is telling. I do wonder if that is transfer learning on unsupervised data though. Learn from experience... But don't know. It could be he/she is an expert solver already :)

Heh, yeah, building torch on a RPI4 is probably a pain.

Yeah, it's totally that humans build up some kind of world model over many years. And that model is able to generalize in this amazing way when presented with tiny increments of data.

davisking commented 2 years ago

Oh and here is one example of why it matters that matrix expressions be user definable. So in https://github.com/davisking/dlib/blob/master/dlib/optimization/optimization_solve_qp2_using_smo.h#L111 is a QP solver. That's the QP solver used interior to some of the SVM codes. Many SVM training codes would run out of RAM (and just generally be slow) if they naively used that solver. That's because the Q matrix is normally huge. Number of training samples by number of training samples.

So what SVM solvers (the ones that use kernels) tend to do is they use an LRU cache to store only some of the columns of Q in memory and lazily compute new ones as needed. Keeping only some max number in memory at a time. But that solver doesn't know anything about that. It's just a vanilla QP solver that minds its own business and does QP solving. Importantly though you can see it references Q as a matrix expression.

So now client code, like https://github.com/davisking/dlib/blob/master/dlib/svm/svm_nu_trainer.h#L185, can just do symmetric_matrix_cache<float>((diagm(y)*kernel_matrix(kernel_function,x)*diagm(y)) to make Q. Which stack two custom matrix expressions together to make a LRU cached kernel matrix. And it all just works even though none of these individual functions know about each other. If you look at other QP solvers for SVMs they are crazy. Super long and tangled together. But having good matrix expression support makes it all way cleaner.

arrufat commented 2 years ago

That's precisely what I was referring to by you spoiling us with dlib.

pfeatherstone commented 2 years ago

@davisking talking about type erasure. I was thinking of adding a general purpose type erasure module similar to boost-ext TE or Dyno, but targeting C++11. What do you think ? hopefully it won't require a concept map like what the Dyno library does and will be succinct like the TE library. I've got it working with c++14 but think it should be possible to have a c++11 version. So you could have polymorphic objects on the stack without inheritance. This is really useful in API design without having loads of smart pointers using inheritance and you can use good old c++ objects.

pfeatherstone commented 2 years ago

@davisking jumping back to DL, have you tried https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html with your torch models ? I haven't. If so, what is your experience with it ? There is also https://github.com/hjmshi/PyTorch-LBFGS

pfeatherstone commented 2 years ago

@arrufat just realised, you've completed all the suggestions in the original post. Great work!

arrufat commented 2 years ago

Oh, I didn't notice, it just kind of happened… Thank you!

pfeatherstone commented 2 years ago

I think I would add:

bipartite matching loss (Hungarian algorithm)
transformer (for Detr like model)
add GIoU, DIoU, CIoU options to loss_yolo
see what happens when pre-training the (darkner53) backbone with Barlow twins loss using unlabelled imagenet, then fine tuning neck and heads with loss_yolo. Hopefully training the backbone with an unsupervised loss gives you better features than one trained specifically for classification. Presumably, with the backbone being frozen and therefore not requiring any gradient computation or batch-normalisation, this should accelerate training and reduce VRAM right ?
Add cutmix augmentation
Enhance CPU with NNPACK

Just for future reference. I might pull my finger out at some point and do some of these. No idea how you do the backprop' of bipartite matching coz I'm so used to autodiff doing everything for me.

pfeatherstone commented 2 years ago


-rw-r--r--  1 adria 1.9M Jan 18 23:43 libyolov3.a
-rw-r--r--  1 adria 4.8M Jan 18 23:43 libyolov4.a
-rw-r--r--  1 adria 5.2M Jan 18 23:43 libyolov4_sam_mish.a
-rw-r--r--  1 adria  15M Jan 18 23:45 libyolov4x_mish.a

Figure out why a single model compiles to 15MB of machine code. I would have thought gcc/clang would optimise better than that.

arrufat commented 2 years ago

Figure out why a single model compiles to 15MB of machine code. I would have thought gcc/clang would optimise better than that.

I think, in part, it's because of the name mangling. I've made a simple executable:

#include "detection/yolov5.h"

int main()
{
    yolov5::train_type_l net;
}

The executable takes about 2.7 MiB (if I strip it, it takes 2 MiB) Then I keep adding some calls after the network declaration:

#include "detection/yolov5.h"

int main()
{
    yolov5::train_type_l net;
    auto loss = net.loss_details();
    auto subnet = net.subnet();
    auto input = net.input_layer();
    const auto& t = net.get_final_data_gradient();
    net.set_gradient_inputs_to_zero();
    net.clean();
}

Now the executable is 3.2 MiB (if I strip it, it takes 2.1 MiB). If you inspect it the binary file with objdump, you'll see the name mangling of these functions, which look something like this (this is for the clean method, as you can see at the very end of the line, but you can find other methods/functions as well):

   2b975:   e8 06 aa 03 00          call   66380 <_ZN4dlib9add_layerINS_4sig_ENS0_INS_4con_ILl1ELl1ELl1ELi1ELi1ELi0ELi0EEENS0_INS_11leaky_relu_ENS0_INS_3bn_ILNS_10layer_modeE0EEENS0_INS2_ILl1024ELl1ELl1ELi1ELi1ELi0ELi0EEENS0_INS_7concat_IJNS_4tag8ENS_4tag9EEEENS_13add_tag_layerILm9ENS0_IS4_NS0_IS7_NS0_INS2_ILl512ELl1ELl1ELi1ELi1ELi0ELi0EEENS_14add_skip_layerINS_4tag7ENSD_ILm8ENS_6repeatILm3EN6yolov53defINS_10leaky_reluENS_6bn_conELl1ELl1ELl1ELl1EE13bottleneck_x8ENS0_IS4_NS0_IS7_NS0_ISE_NSD_ILm7ENS0_INS9_IJNS_4tag1ENS_4tag5EEEENSD_ILm1ENS0_IS4_NS0_IS7_NS0_INS2_ILl512ELl3ELl3ELi2ELi2ELi1ELi1EEENSF_INS_4tag2ENSD_ILm4004ENS0_IS1_NS0_IS3_NSD_ILm2ENS0_IS4_NS0_IS7_NS0_ISE_NS0_ISC_NSD_ILm9ENS0_IS4_NS0_IS7_NS0_INS2_ILl256ELl1ELl1ELi1ELi1ELi0ELi0EEENSF_ISG_NSD_ILm8ENSH_ILm3ENSM_13bottleneck_x4ENS0_IS4_NS0_IS7_NS0_IST_NSD_ILm7ENS0_INS9_IJSO_NS_4tag4EEEENSD_ILm1ENS0_IS4_NS0_IS7_NS0_INS2_ILl256ELl3ELl3ELi2ELi2ELi1ELi1EEENSF_ISS_NSD_ILm4003ENS0_IS1_NS0_IS3_NSD_ILm2ENS0_IS4_NS0_IS7_NS0_IST_NS0_ISC_NSD_ILm9ENS0_IS4_NS0_IS7_NS0_INS2_ILl128ELl1ELl1ELi1ELi1ELi0ELi0EEENSF_ISG_NSD_ILm8ENSH_ILm3ENSM_13bottleneck_x2ENS0_IS4_NS0_IS7_NS0_ISY_NSD_ILm7ENS0_INS9_IJSO_NSI_5ptag3EEEENSD_ILm1ENS0_INS_9upsample_ILi2ELi2EEENSD_ILm4ENS0_IS4_NS0_IS7_NS0_IST_NS0_IS4_NS0_IS7_NS0_ISE_NS0_ISC_NSD_ILm9ENS0_IS4_NS0_IS7_NS0_IST_NSF_ISG_NSD_ILm8ENSH_ILm3ESU_NS0_IS4_NS0_IS7_NS0_IST_NSD_ILm7ENS0_INS9_IJSO_NSI_5ptag4EEEENSD_ILm1ENS0_IS13_NSD_ILm5ENS0_IS4_NS0_IS7_NS0_ISE_NS0_IS4_NS0_IS7_NS0_IS8_NS0_INS9_IJSO_SS_NS_4tag3ESV_EEENSD_ILm4ENS0_INS_9max_pool_ILl5ELl5ELi1ELi1ELi2ELi2EEENSD_ILm3ENS0_IS19_NSD_ILm2ENS0_IS19_NSD_ILm1ENS0_IS4_NS0_IS7_NS0_ISE_NSD_ILm7005ENS0_IS4_NS0_IS7_NS0_IS8_NS0_ISC_NSD_ILm9ENS0_IS4_NS0_IS7_NS0_ISE_NSF_ISG_NSD_ILm8ENSH_ILm3ENSM_16resbottleneck_x8ENS0_IS4_NS0_IS7_NS0_ISE_NSD_ILm7ENS0_IS4_NS0_IS7_NS0_INS2_ILl1024ELl3ELl3ELi2ELi2ELi1ELi1EEENSD_ILm7004ENS0_IS4_NS0_IS7_NS0_ISE_NS0_ISC_NSD_ILm9ENS0_IS4_NS0_IS7_NS0_IST_NSF_ISG_NSD_ILm8ENSH_ILm9ENSM_16resbottleneck_x4ENS0_IS4_NS0_IS7_NS0_IST_NSD_ILm7ENS0_IS4_NS0_IS7_NS0_ISR_NSD_ILm7003ENS0_IS4_NS0_IS7_NS0_IST_NS0_ISC_NSD_ILm9ENS0_IS4_NS0_IS7_NS0_ISY_NSF_ISG_NSD_ILm8ENSH_ILm6ENSM_16resbottleneck_x2ENS0_IS4_NS0_IS7_NS0_ISY_NSD_ILm7ENS0_IS4_NS0_IS7_NS0_ISX_NS0_IS4_NS0_IS7_NS0_ISY_NS0_ISC_NSD_ILm9ENS0_IS4_NS0_IS7_NS0_INS2_ILl64ELl1ELl1ELi1ELi1ELi0ELi0EEENSF_ISG_NSD_ILm8ENSH_ILm3ENSM_16resbottleneck_x1ENS0_IS4_NS0_IS7_NS0_IS1E_NSD_ILm7ENS0_IS4_NS0_IS7_NS0_INS2_ILl128ELl3ELl3ELi2ELi2ELi1ELi1EEENS0_IS4_NS0_IS7_NS0_INS2_ILl64ELl6ELl6ELi2ELi2ELi2ELi2EEENS_15input_rgb_imageEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEEEvEEEEvEEvEEvEEvEEvEEvEEvEEvEEvEEvE5cleanEv>

So, as we keep calling methods on the network or functions that take them as a parameter (such as operator<<) their mangled names keep adding to the size of the binary file (since the types of these networks are really long). As a result, serializing, printing the network, changing the number of filters in a convolution, calling a visitor, accessing a particular layer with layer<idx>(net) will make the binary file larger and larger.

I guess it's just the way it is…

EDIT: if I use yolov5::train_type_n, that is, the smallest YOLOv5 variant, the size doesn't change, since these networks have all the same definition in my implementation: the only thing that changes are the width (filters per convolutional layer), and depth (the count in each repeat layer, which you can see in the mangled name). This means they compile equally fast, and generate the same binary size.

So, concluding, one way to reduce the size of the binary size is to use repeat layers as much as possible. My YOLOv4x model could be improved in that sense. On the other hand, I think the YOLOv5 implementation is as good as it gets.

dlibml / darknet

Feature suggestions #3

On Eigen

On Deep Learning in dlib