Open szha opened 5 years ago
@szha Really great proposal and we may want to add some items in 2.0 too. Is there a timeline of 2.0?
Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended label(s): Feature
Is there a plan to create a branch either for the 1.x version and have master reflect 2.0 or to create a branch for the 2.0 version and keep master on 1.x for now?
@pengzhao-intel a tentative target date is by end of Q1 2020.
@zachgk we will create a branch for 2.0. Initially we will keep master to be 1.x and have 2.0 in a new branch. After 1.6 release we will revisit how to make the 2.0 branch the master.
Just a quick cheer up for a new website of MXNet... its way more awesome and beautiful than I expected. Though minor bugs are still there, for ex- most of the link in the tutorials are broken and not working. Anyways great work so far.
Any plan to simplify the build of c and c++ api for mxnet2.0?It is hard(or very hard) to build a working version of mxnet with cpp api on different platforms(windows, linux, mac), every new release of the mxnet may or may not break something and we need to spend many hours to figure out how to make it work.
I am happy with python api, but not all of the tasks suitable for python. Almost every deep learning tools are based on c and c++, but almost everyone of them are difficult to or partially work with c and c++.
@stereomatchingkiss good point. What are you using c/c++ api for?
@stereomatchingkiss good point. What are you using c/c++ api for?
Maybe you could open a post to ask the users what are they expect for c or c++ api, I guess most of them only need to use the api to perform inference task but not training(python do a great job about this), this should help you shrink the size of the libs and made the codes less complicated.
@stereomatchingkiss That's a bit what amalgamation part was for ? a simplified inference interface. The last time I use amalgamation (some years ago) it was often break by update and not really maintain.
The status of MXNet 2.0 project is tracked at: https://github.com/apache/incubator-mxnet/projects/18. The status for each project will be updated by the contributor who's driving it. If you have more projects that you intend to drive please first discuss here.
Once 1.6 release is complete, we will create a branch for MXNet 1.x for future releases and start using master branch for 2.0 development.
Should we create a new branch for 2.0? I think we are also planing for 1.7.0 https://github.com/apache/incubator-mxnet/issues/16864
In the past we always kept development on the master branch, thus how about branching out 1.7.0 release branch and keeping development on master?
+1 for using master branch for 2.0 development. I think we need 3 branches at least:
That's what I had in mind. The v1.7.x branch doesn't have to be created until code freeze for 1.7.0
3.1. C-API Clean-up C-API is the foundational API in MXNet that all language bindings depend on.
@szha I'm looking at the item 3.1.2. Could you please explain the scope of C-API? Do you mean those APIs sit in the src/c_api/ folder?
@TaoLv one promising direction that the community is converging to is the interface based on packed function (motivation as described by @tqchen in https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-567604444). What this means to the project is that the existing c API will be updated to follow the packed function interface.
Is there a plan to remove the cudnn_off argument from the neural network operators such as Dropout, Convolution, Pool etc. It creates a few usability issues:
(1) Once a model is exported. It requires users to change this flag in all the layers manually if they want to enable/disable cuDNN.
(2) When the cudnn_off is set to true in some layers, the global env variable MXNET_CUDNN_AUTOTUNE_DEFAULT
becomes don't care. It's very confusing to users to see an error message like "Please turn off MXNET_CUDNN_AUTOTUNE_DEFAULT" by indeed it does not do anything.
(3) Why did we expose such implementation detail to users at the first place? In the worst case, we should just provide a global variable to turn on/off cuDNN in all layers instead of at operator level.
Thanks for this awesome work, it has benefited me a great deal.
Here are some disadvantages(may be) listed blow:
Good day everyone.
@kalcohol please create a new issue about "static linking lib is (very) far away from easy to use", describing your setup in more detail and potentially suggestions how to improve the user experience.
@kalcohol please create a new issue about "static linking lib is (very) far away from easy to use", describing your setup in more detail and potentially suggestions how to improve the user experience.
We use the c++ interface for inference on a sorting machine. But also we would like to provide the users of our machines an easy and integrated user interface for training new sorting recipes. Now we use python or Mathematica scripts which is far of user friendly for non-programmers. So we want to use the c++ (shielded with a C# wrapper) to provide a custom training environment for non-programmers.
Unfortunately, building the mxnet library with c++ support on Windows machine with MKL / CUDA is an ongoing nightmare. But we really like MxNet
@szha i checked some docs and projects about distributed training , 'Horovod' is project from uber team , 'Gloo' is project from facebook team. The basic idea is to use trick from HPC computing field which is more efficient then traditional param-server: http://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/?from=timeline There is a tool called openmpi on which the 'Horvod' project is based ,but i found openmpi is too difficult to configure and use . I also check the 'Gloo' which seems to use 'redis' to replace 'openmpi' . I strongly suggest not to use Horovod directly which is based on openmpi that is too complex and old.
I also find bytedance has a good project solving the same problem not using MPI , https://github.com/bytedance/byteps
maybe we cant better integrate bytedance solution in roadmap 2.0 . or we can have a mxnet internal solution similar to bytedance solution.
@lilongyue the integration of bytePS to mxnet is in this PR https://github.com/apache/incubator-mxnet/pull/17555
@lilongyue the integration of bytePS to mxnet is in this PR #17555 that's great !
A quick comment: DGL contains all sampling implementation and no longer relies on the implementation in MXNet. I think we should deprecate the graph sampling implementation in MXNet.
@szha is there a recent estimate on the timeline for MXNet 2.0? Would you recommend to develop downstream toolkits (e.g. Sockeye) against the master branch now or rather wait a little bit longer? Is there already documentation on how to transition MXNet 1.x projects to 2.x?
@fhieber we are planning to release the first public beta on this somewhere in August. At the moment we are finalizing some API changes and also validating them in GluonNLP. We will publish a transition doc as part of the public beta.
@szha We need to add moving AMP package from contrib to core? We will file RFC for this task.
@szha I found an inconvenient thing that there is no concat
layer for gluon. Is it possible to add a concat
layer for gluon?
Making MXNET_SAFE_ACCUMULATION=1
default when running on float16 would be very convenient!
+1 for turning it on by default.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Davis Liang notifications@github.com Sent: Tuesday, August 18, 2020 5:32:29 PM To: apache/incubator-mxnet incubator-mxnet@noreply.github.com Cc: Sheng Zha dmlc.notification@gmail.com; Mention mention@noreply.github.com Subject: Re: [apache/incubator-mxnet] [RFC] Apache MXNet 2.0 Roadmap (#16167)
Making MXNET_SAFE_ACCUMULATION=1
default when running on float16 would be very convenient!
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-mxnet/issues/16167#issuecomment-675784800
I made some good progress with the C# version for v2 changes. I have implemented most of the numpy operators in v2 till date and in phase of updating Gluon interface as per latest python version and to use numpy api's. Can we include/promote this project from the main website to attact more contributors.
@deepakkumar1984 awesome work, thanks for contributing to the ecosystem! I think we can definitely highlight it in the ecosystem page as a community project. Feel free to send a pull request to add it there. If you are interested, once it gets close to completion, we could also publish a blog to attract more attention.
How do you envision the codebase to be maintained and hosted going forward?
Thanks @szha, I will start working on the PR to highlight in the ecosystem page. I did started on writing some tutorials eg. https://mxnet.tech-quantum.com/docs-2/getting-started/create-a-neural-network/, but prefer in future to maintain these blogs similar to other bindings like https://mxnet.apache.org/versions/2.0/api/csharp. MxNet Sharp is more than just binding of the api's, I have implemented the Gluon package in version 1.5 itself and now in process of upgrading them. Also the gluon.probability will be implemented after completion of the gluon interface.
I am happy if the core project MxNet.Sharp can be merged with the main branch something like: https://github.com/apache/incubator-mxnet/csharp-package
I have other projects which are making small steps like GluonCV, GluonNLP, GluonTS, AutoGluon and SciKit learn (MxNet version). I can seperate them from my branch and keep them with me for now and probably start linking them in future in the ecosystem page when they are completing one by one.
Cpp-package will be added back in #20131. As this language binding will still rely on symbolic programming, some of the module like APIs removed in #18531 will also be added back. So, we may need to support these module APIs for some languange bindings, especially for cpp-package. @szha @leezu
Overview
Status: https://github.com/apache/incubator-mxnet/projects/18 The status for each project will be updated by the contributor who's driving it. If you have more projects that you intend to drive please first discuss here.
The purpose of this RFC is to organize and present the roadmap towards 2.0. As 2.0 will be a major release, changes that would break backward compatibility are permissible.
The proposed changes in this RFC are either collected from past roadmap discussions such as #9686, or are based on various common issues from the past. This RFC organizes these changes into self-contained projects to facilitate clear definition of project, captures the risks and status quo to the best of our knowledge. To help navigate, the projects are further divided into several high-level areas. Some of the listed projects are already in progress, and are included to provide a clear overview.
The objectives of Apache MXNet 2.0 include:
In terms of frontend, this roadmap focuses mostly on Python-frontend since MXNet has been taking a Python-first approach. The expectation with respect to other language bindings is that they would evolve along with the backend evolution and make use of the improvements. Given that breaking changes can occur, maintainers of different language bindings are expected to participate in related interface definition discussions.
1. MXNet NP Module
NumPy has long been established as the standard math library in Python, the most prevalent language for the deep learning community. With this library as the cornerstone, there are now the largest ecosystem and community for scientific computing. The popularity of NumPy comes from its flexibility and generality.
In #14253, the MXNet community reached consensus on moving towards a NumPy-compatible programing experience and committed to a major endeavor on providing NumPy compatible operators.
The primary goal of the projects below is to provide the equivalent usability and expressiveness of NumPy in MXNet to facilitate Deep Learning model development, which not only helps existing deep learning practitioners but also provides people in the existing NumPy community with a shortcut for getting started in Deep Learning. The efforts towards this goal would also help a secondary goal, which is to enable the existing NumPy ecosystem to utilize GPUs and accelerators to speed up large scale computation.
1.1. NumPy Operator Testing
Scope:
1.2. NumPy Operator performance profiling
Scope:
1.3. NumPy operator coverage
Scope:
Operator coverage as of 07/03/2019
1.4. NumPy Extension Operator Reorganization and Renaming
Scope:
1.5. NumPy ndarray type extension
Scope:
1.6. NumPy ndarray boolean indexing
Scope:
1.7. Hybridizable basic (and advanced) indexing
Scope:
Note: Preliminary work: https://github.com/apache/incubator-mxnet/pull/15663
2. Graph Enhancement and 3rdparty support
The objective of the following projects is to enable easier development of third-party extensions without requiring changes to be checked in the MXNet project. Examples of such extensions include third-party operator library and accelerators.
2.1. Graph Partitioning for Dynamic Shape Operators
Scope:
2.2. Improved Third-party Operator Support
Scope:
2.3. Improved Third-party Backend Support (subgraph property)
Scope:
2.4. Large tensor support by default
Scope:
Risks:
Notes: in progress (RFC: https://lists.apache.org/thread.html/df53b8c26e9e0433378dd803baba9fec4dd922728a5ce9135dc164b3@%3Cdev.mxnet.apache.org%3E)
3. API Changes
The objective of the following projects is to address the technical debts accumulated during the development of MXNet 0.x and 1.x with respect to the API definition.
3.1. C-API Clean-up
C-API is the foundational API in MXNet that all language bindings depend on.
Scope:
Risks:
3.2. Unify Executor
Scope:
3.3. Gradient of Gradient support
Scope:
Risks:
3.4. Autograd Extension
Scope:
3.5. NNVM-backend Operator Interface Changes
Scope:
Risks:
4. Gluon 2.0
Since the introduction of the Gluon API, it has superceded other API for model development such as symbolic API and model API. Conceptually, Gluon is the first attempt in the deep learning community to unify the flexibility of imperative programming with the performance benefits of symbolic programming, through trace-based just-in-time compilation.
The objectives of the following projects are:
4.1. Unifying symbolic and imperative mode for tensor library
Scope:
4.2. Unifying Block and HybridBlock
Scope:
4.3. Gluon Block Enhancement
Scope:
4.4. Enable Symbolic Shape (& Dtype) for Array Creation in NNVM-backend
Scope:
4.5. Gluon Distributions Module
Scope:
4.6. Gluon Metrics Module
Scope:
4.7. Gluon Optimizer Module
Scope:
4.8. Gluon Data API Extension and Fixes
Scope:
4.9. Gluon Estimator Extension for Experimenting Utilities
Scope:
4.10. Gluon Estimator Refactoring for Examples and Tutorials
Scope:
4.11. Gluon Distributed Training Usability Enhancement
Scope:
5. Documentation
Documentation is the most important factor for new adoption of a library. The following projects aim to:
5.1. MXNet 2.0 Migration Guide
Scope:
Risks:
5.2. MXNet 2.0 Developer Guide
Scope:
5.3. Adopt beta.mxnet.io as official website
Scope:
Note: https://github.com/ThomasDelteil/mxnet.io-v2
6. Profiling and Debugging
Profiling and debugging is a common step in the development of deep learning models, and proper tools can help significantly improve developer's productivity. The objective of these projects is to provide such tools to make it easier to discover issues in correctedness and performance of models.
6.1. Memory Profiler
Scope:
6.2. Enhanced Debugging Tool
Scope:
7. Advanced Operators
The objective of these projects are to extend the tensor library and operators for better performance and for advanced use.
7.1. Strided ndarray support
Scope:
7.2. Ragged ndarray and operators
Scope:
7.3. Improved Sparse Support
Scope:
Minimum support:
Next-level support:
8. Building and Configuration
8.1. CMake improvement and Makefile deprecation
Scope:
8.2. MXNet Configurator
Scope:
9. Advanced training and deployment
9.1. Automatic Quantization and Quantized Training for NumPy
Scope:
9.2. Mobile and edge-device deployment
Scope:
10. Performance
10.1. MXNet Execution Overhead
Scope: