Contribute, refactoring suggestions, modernize C#, DataSet, DataLoader

nietras commented 2 years ago

Hi @NiklasGustafsson, I would like to contribute to TensorSharp as we are looking at using it to replace our CNTK usage in our full end-to-end machine learning pipelines written in C#. I have been looking at this example repo in that regard, which is a great starting place. I've mainly work with image models so I've been looking at the CIFAR10 example. I understand that these examples have been created quickly and that they are bare-bones, I would like to improve them. :)

For example, I have a few issues with the Readers like CIFAR10Reader and how they both randomize data by pre-defining randomized batches, which is not normally how you would do this, you create unique random batches for each epoch. Similarly, an epoch would usually (nothing is standardized here and you really can do anything you'd like so this is just IMHO) be defined by iterating our the samples of the dataset once, not by adding transforms after and hence multiplying by that like:

        public IEnumerable<(Tensor, Tensor)> Data()
        {
            for (var i = 0; i < data.Count; i++) {
                yield return (data[i], labels[i]);

                foreach (var tfrm in _transforms) {
                    yield return (tfrm.forward(data[i]), labels[i]);
                }
            }
        }

Also you wouldn't "transform" or augment the data if it is test data, of course you can then just not set the transforms.

Anyway, I was thinking my first contribution could be to refactor the readers to and implement concepts similar to pytorch DataSet and DataLoader. I have worked with this API but am not an expert, nor I am necessarily a fan of the python APIs, but it seems you'd like to have TorchSharp be similar to pytorch, so basing it on that makes sense. Would that be of any interest?

Before doing this I would very much like to migrate this example repo to .NET 6 and C# 10 too and follow standard C# code guidelines and use modern language features, to really make the examples shine with regards to C#. Since performance is my passion I'd also like the examples to at least minimally try to be efficient about what happens, even in cases where it does not matter so much.

Just an example below, I would replace the below with a proper Fischer-Yates shuffle, that is easy to implement.

Enumerable.Range(0, count).OrderBy(c => rnd.Next()).ToArray();

Reproducibility is important too, so all random stuff should be seeded.

Sorry, I am sure you know all this, but I wanted to at least ask whether such changes are of interest first? If you guys agree with them?

To recap I propose:

Migrate to .NET 6 and C# 10
Refactor readers based on a DataSet and DataLoader concept (rough draft)
- Address minor various issues as part of this

And we can take it from there.

NiklasGustafsson commented 2 years ago

@nietras -- I think that would be great!

I'm taking this coming week off, but let's discuss the week after.

A couple of initial reactions:

It would be great to get help here.
Yes, the data loading in the examples was a hack to get to the part where we demonstrate building models, training, etc. That's why I kept them in the examples rather than in torchvision, where they belong.
As a design principle, we should stick close to Python when it makes sense for the "transportability" of code, but we should stay true to .NET when possible. Just something to keep in mind when designing this -- we don't need it to look like PyTorch just for the sake of it, only if it makes transporting code easier for users. For example, I think it is more important that the data readers compose well with .NET iteration constructs than follow the PyTorch iterator designs.
The API should be designed to work properly and beautifully across C# and F#.
When designing the API/implementation, we should consider torchtext, as well.

I'm reluctant to agree to moving things to .NET 6 as a dependency. Lots of users will be slow to update to the latest and greatest, and we need breadth. Just FYI -- I've had people push for .NET Standard 2.0 as the baseline for TorchSharp, but I thought that was going too far back.

Also, if you contribute, please introduce yourself to the community here: Welcome to TorchSharp Discussions!. It's always nice for users to know the folks who helped produce the stuff they're using.

GeorgeS2019 commented 2 years ago

@NiklasGustafsson => @nietras is an ONNX veteran. His project on ONNXSharp and repos

@nietras We have ongoing discussion related to ONNX. Hope you find that of interest to you and you can share your opinion.

NiklasGustafsson commented 2 years ago

That's great!

nietras commented 2 years ago

@NiklasGustafsson if you are interested we could do an introductory Teams meeting when you're back from vacation. I have a lot of questions and ideas too.

As @GeorgeS2019 mentions I have a very strong interest in TorchSharp getting support for reading/writing onnx files, without that we can't really use it at my work. I wouldn't call myself an onnx veteran though 😅 I am an ML veteran for sure, though. With lots of real world experience. We use onnx runtime for inference in dotnet apps. So everything's basically dotnet. ❤️

nietras commented 2 years ago

@NiklasGustafsson regarding dotnet 6, I think what examples target and what the underlying library targets are two separate things. I can understand having the library target netstandard2.0 and multi target higher too, we still have netfx apps at my work. Although for a library not yet in 1.0 there should be some leeway.

An example repo though should focus on showing the library and platform at its best I think, as it would look now. It's a getting started repo after all, and people getting started should be able to install latest runtime to try out.

For dataloader etc. IAsyncEnumerable seems like a good match for example.

Overall, is multi targeting an option in library ? Only making some APIs available for higher targets? Span APIs are a must for example.

NiklasGustafsson commented 2 years ago

Span<> and System.Range are the essential reasons not to go too far back in compatibility, IMO.

NiklasGustafsson commented 2 years ago

@NiklasGustafsson if you are interested we could do an introductory Teams meeting when you're back from vacation. I have a lot of questions and ideas too.

Sure, that sounds good. My email address is available in my GH profile. Send me a request for early next week, sometime. I’m back in the office on the 29th, and on Pacific Time.

nietras commented 2 years ago

Span<> and System.Range are

As you are probably aware you can still use Span by using the System.Memory package with portable Span. Additionally, I have extensive experience with the issues of netstandard2.0 and netfx and why you really need to target net45 too in that case, e.g. if using from net461. You can multitarget your way around a lot of this, only have some APIs available on some targets and so on. Requires more work for sure, but it is possible.

Also I lend a small helping hand in the design and implementation of both Span and Unsafe in dotnet, so lots of insights there.

And thanks I'll get in touch.

nietras commented 2 years ago

Closing as simply an introduction. First one line PR at https://github.com/dotnet/TorchSharp/pull/461

dotnet / TorchSharpExamples

Contribute, refactoring suggestions, modernize C#, DataSet, DataLoader #12