Closed szha closed 4 years ago
Do we have sufficient automated testing to catch accidental lapses?
If not, can we have a volunteer to work on writing these automated test-cases? How do we track this task?
Refactors of the cpp-package and other C++ APIs. I would like that.
@sandeep-krishnamurthy Please tag this - API Change, Call for Contribution, Roadmap.
kvstore should not be public API
we should merge element wise ops with broadcast ops and dispatch the different implementation only based on the shape, so that symbol and ndarray +-*/ are consistent
contrib.ctc_loss should make into supported operator.
fix_gamma=False for mxnet.symbol.BatchNorm
Gluon RNN layer parameters are currently saved through unfused cells, causing the name to be something like "_unfused.0.l_cell.weight". This caused trouble in #11482 when I removed unfused cells. The workaround is to override _collect_params_with_prefix function to add the prefix. In 2.0, we should:
Taking a brief look at Data Iterators, it would seem the iterators are split up between the mx.io module and mx.image module. And there does not seem to be any method/process( correct me if I am wrong) in the split. For instance,
And
Is there any specific reason for this kind of design? It might be good to take a relook at this and reorganize this, even if it leads to breaking a few APIs.
And there is similar functionality in the gluon interface too( and I am not including that in this discussion).
3. transform argument in the constructor of existing vision dataset API.
What is the proposed change here? is the plan to remove transform
as an argument to the constructor?
@anirudhacharya yes, because dataset interface has a .transform
method that serves the same purpose but strictly more flexible.
@anirudhacharya
I can see your concern, but iterators included in mxnet.image
are specifically designed for images and serve as compliments to the purpose agnostic iterators in mxnet.io
.
Same stories apply to all these image transformation functions provided in mxnet.image
, users basically can use them as an opencv alternative in order to process mx.nd.NDArray
instead of numpy.array
@zhreshold but ImageRecordIter and ImageRecordUInt8Iter which are image specific are defined under mx.io.
With regards to image transforms, I was thinking the symbolic interface should also have something similar to the interface available in gluon-CV transforms - https://gluon-cv.mxnet.io/api/data.transforms.html, which is very intuitive and not cluttered. Because we have users who have gone to production using MXNet's symbolic interface. We can discuss this in-person, it will be better.
My 2 cents: The logically flawed all-integer indexing should be fixed:
>>> arr = mx.nd.ones((2, 3))
>>> arr[0, 0] # what it is currently
[1.]
<NDArray 1 @cpu(0)>
>>> arr[0, 0] # what it should be
1.0
<NDArray @cpu(0)>
In the same way, an array of shape ()
should be convertible to scalar via asscalar()
, which it currently isn't.
See the discussion here for context.
@kohr-h yes, having 0th order tensor support is important. Regarding the approach, we may not want to implement it as an asscalar() call, since that's a blocking call. Instead, we should extend the tensor definition so that calculation involving such scalar can still be asynchronous.
@szha Absolutely, the more natural the support for 0-th order tensors, the better.
My remark was more aimed at the current exception that you get when calling
>>> arr[0, 0].reshape(()).asscalar()
...
ValueError: The current array is not a scalar
That could actually be fixed already now, unless there is code that expects this to fail.
In 2.0, the CAPI should not expose backend-specific types, such as mkldnn::memory
.
~log_softmax return type https://github.com/apache/incubator-mxnet/pull/14098#pullrequestreview-201770450~ (resolved with optional dtype argument)
I have some suggestions.
Custom Operator for deployment
Currently, MXNet doesn't support the custom operator for deployment unless rebuilding from the source.
Although there is the MXCustomOpRegister
API, it is inconvenient and does not provide mshadow::Stream
, which is important for executing an operator asynchronously.
We need an approach (easy to write and compile) to support custom operator for deployment.
Refactor of the C++ package I hope that the syntax of the C++ package is the same as that of the Python package, and it will be more friendly for C++ users.
Edit: I wrote a prototype of Custom Operator Class. https://github.com/wkcn/DLCustomOp
Mxnet-lite based on NNAPI and quantization tools.
+1 for a better custom OP API
Would it be good to be able to to add tag / name to ndarrays so when executing a graph you know which array is what. This would be an optional name as a string to the ndarray constructor. It will also pass this name inside the graph. For example when marking variables for autograd we are just setting: nnvm::Symbol::CreateVariable("var" + str_c).outputs[0].node, 0, 0};
which is just a number, not very useful. also similar with node_17. It doesn't help debugging.
@wkcn what do you mean "For deployment" WRT custom op?
https://github.com/apache/incubator-mxnet/issues/12681 Gluon batch_norm parameter name running_xxx should be renamed to moving_xxx for consistency
topk operator: ret_typ
argument should be return_type
. Also, return_type
should be int64 not float32.
@larroy If a model which need to be deployed uses custom operator, it's necessary to rewrite the custom operator in C++/C language. It will be better if we can write C++/C custom operator without rebuilding the source of MXNet. Reference: https://cwiki.apache.org/confluence/display/MXNET/Dynamic+CustomOp+Support
I'd like to share my opinions on the roadmap to build MXNet 2.0.
The organization of source files and namespace are not friendly now for the redundancy and chaos. I have understood the origin design wills that building a simple system with pluggable components (e.g., KV store, NDArray, TVM, etc.). However, this embedding source codes manner may result in chaos and increase the complexity of a system. From my experiences, When I want to build the MXNet in debug mode. the packed dynamic libraries always go to failure, because the size of libraries has exceeded the system limitation (By default, an executable file cannot use more than 2GB storage? it is partially solved by masking the unused CUDA generation codes). I used to look through the source code of Caffe, which is really an art and may be a good template for us to decide how to design a better MXNet in future. For example, can we use the third dependencies as pluggable dynamic libraries, instead of embedding their source code and building together?
I have experienced many DL frameworks such as TensorFlow (another big chaos, too), PyTorch, MXNet, Caffe, etc, and I finally choose the MXNet which I want to contribute to. MXNet has the best performance in terms of design wills, throughput, memory cost, and scalability, but it seems that we are fading away from both users and contributors. I hope we can try our best to make such an extraordinary DL framework great again.
F.Y.I.
https://github.com/apache/incubator-mxnet/issues/10692 might require an API change to fix.
Edit: reference issue is closed but we may want to think about tweaking the monitor callback api.
... such as dropping support for python2
Was there a decision to drop Python 2? If so, we should consider putting our logo to https://python3statement.org/.
The CMake for all DMLC projects needs a throughout refactor. It is simply a disaster when integrating with dmlc-core, xgboost, mxnet, etc. Some of them do not have install. None of them export targets, IIRC. This results in the users of these packages being responsible for maintaining the compiler definition and linking dependencies, which should not be the case.
Should bump the cmake_minimum_required to 3.5.1, and make use the target based commands, but some quirk exist. If you guys be more aggressive (install cmake in user local workspace), then 3.12 would be the best choice.
Distro | CMake version | Status |
---|---|---|
Ubuntu 16.04 | 3.5.1 | Most Popular In DL |
Debian 8 “Jessie” | 3.0.2 | EOL on 2018-06-17, EOL LTS on ~2020-06-30 |
Debian 9 “Stretch” | 3.7.2 | Current Stable |
CentOS | latest via EPEL |
FYI, currently I take dmlc's CMakeLists.txt as the lower bound...
@cloudhan The latest version of dmlc-core exports a CMake target. See https://github.com/dmlc/dmlc-core/blob/master/doc/build.md
Same goes for XGBoost: https://github.com/dmlc/xgboost/blob/master/demo/c-api/README.md
I'd like to see mobile device support. Pre-built binaries that developers can use for Android development. And iOS.
I would like to refactor the operator graph and have it in the MXNet codebase, tracking this proposal in the wiki: https://cwiki.apache.org/confluence/display/MXNET/MXVM%3A+Operator+graph+2.0
Maybe make backward propagation computation optional when compile, specifically using for deploying.
The param file of gluon model and the param file exported from gluon HybridBlock have different naming style. It will be more convenient to have only one of them and just output json symbol file from mx.gluon.HybridBlock.export
.
When you have lost gluon model file and only have the exported param file, it's very hard to convert it back.
I'd like to share my opinions on the roadmap to build MXNet 2.0.
There should be the only gluon Sequential api which could be hybridizable instead of having 2 separate Sequential and HybridSequential apis.
A fit method like api for gluon models would be a dream come true. For example:-
model = nn.Sequential()
...
...
...
model.initialize(mx.init.Xavier(), mx.gpu())
model.hybridize()
model.build(loss = 'cross entropy', optimizer = 'adam')
model.train(train_data, val_data, epochs = 10)
Should print something like:-
Epoch(01/10) [=========================>] Training -> Loss: 0.78415 Accuracy: 0.74588 Validation -> Loss: 0.78415 Accuracy: 0.74588
Epoch(02/10) [=========================>] Training -> Loss: 0.64415 Accuracy: 0.79588 Validation -> Loss: 0.70405 Accuracy: 0.75588
Epoch(03/10) [=========> ] Training -> Loss: 0.58475 Accuracy: 0.82588 Validation -> Loss: 0.68454 Accuracy: 0.79588
- Computing higher order gradient is the only critically lacking feature of MXNet.
- The official website for MXNet lags like hell. A better looking and good official website is desperately needed.
There is a [beta version of the new MXNet website](https://beta.mxnet.io/).
But it's in beta from almost something like two years. What's taking so long!! Websites for other frameworks like Tensorflow, Pytorch are updated for more than like 5 times within just a year!!
BTW Thanks to everyone for all the contributions so far. MXNet is by far my favourite framework(after using Tensorflow(1.x - 2.0) and Pytorch). I'm currently busy in some personal stuff, but will definitely contribute after some time.
Thanks for your time.
MXNet is by far my favourite framework
Thanks for the encouraging words! It makes all the efforts worthwhile. The items listed are indeed good suggestions (and some of them are already available (: )
There should be the only gluon Sequential api which could be hybridizable instead of having 2 separate Sequential and HybridSequential apis.
HybridSequential is a HybridBlock which only allows HybridBlock children and hence the current design.
A fit method like api for gluon models would be a dream come true.
There's an estimator class with the fit function in contrib now: https://github.com/apache/incubator-mxnet/pull/15009/files#diff-7f58d4d4cb6c2e6088afa89097fbb7e3R34. It can be extended to support progress bar (cc @roywei)
Computing higher order gradient is the only critically lacking feature of MXNet.
Yes, some community members are pooling efforts and working on it (cc @apeforest). Proposal can be found here: https://cwiki.apache.org/confluence/display/MXNET/Higher+Order+Gradient+Calculation
The official website for MXNet lags like hell.
(cc @aaronmarkham) Is it due to network or is the javascript not running smoothly? If the former we're more than happy to look into the CDN options. Just let us know where you're accessing the website from.
There's an estimator class with the fit function in contrib now: https://github.com/apache/incubator-mxnet/pull/15009/files#diff-7f58d4d4cb6c2e6088afa89097fbb7e3R34. It can be extended to support progress bar (cc @roywei)
Thanks for pointing out that fit api is already available.
Is it due to network or is the javascript not running smoothly?
It's the javascript that always says something like "Processing math: 10%". I'd like to see the beta version as the official website of MXNet, it's more faster and has better looking UI.
Thanks for the encouraging words! It makes all the efforts worthwhile. The items listed are indeed good suggestions (and some of them are already available (: )
Anytime 🍻
I'd like to give some suggestions to improve Estimator experience.
The existing "evaluate" method should be renamed to "_evaluate" and a new "evaluate" should be defined for users to quickly evaluate the model on some test data. For example:-
estimator.evaluate(test_data, [mx.metric.Accuracy(), mx.metric.TopKAccuracy()])
Currently does the evaluation internally and update the provided metrics but "prints nothing". Should print something like:-
accuracy: 0.97852
top_k_accuracy_3: 0.12589
If the user has provided no eval_data, then print only training loss and accuracy(or other metrics if provided)
estimator.fit(train_data = train_data, epochs = 10)
Currently prints:
[Epoch 0] Finished in 13.548s, train accuracy: 0.3320, train softmaxcrossentropyloss: 1.7761, validation accuracy: nan, validation softmaxcrossentropyloss: nan
Should print
[Epoch 0] Finished in 13.548s, train accuracy: 0.3320, train softmaxcrossentropyloss: 1.7761
If the user has provided eval_data then instead of printing train and eval metrics in one line(which makes difficult to read), should print in two separate 2 lines. .fit(....) currently prints:
[Epoch 0] Finished in 16.062s, train accuracy: 0.7316, train softmaxcrossentropyloss: 0.7813, validation accuracy: 0.7237, validation softmaxcrossentropyloss: 0.8057
Should print
Epoch (01/10):
Training -> Loss: 0.78134 Accuracy: 0.73165
Validation -> Loss: 0.78415 Accuracy: 0.80577
Printing time taken per epoch actually kills performance by 20%, so I don't think that there is any need to print the time taken by the model to train per epoch unless provided by the user(we can add an additional argument "benchmark" to .fit call, False by default, if True only then print the time taken per epoch and total time taken.
If the user has used no learning rate scheduler then there is no need to print [Epoch 5] Begin, current learning rate: 0.0010
each iteration.
There should be a "history" method that returns all list of losses and metrics values we have encountered while training so that user can visualize them by plotting them on some visualization framework like matplotlib. For example:-
history = estimator.history()
plt.plot(history[0]) plt.title('Training Loss')
plt.plot(history[1]) plt.title('Training Accuracy')
Thanks again.
Hi @mouryarishik Thank you so much for such detailed suggestions, really appreciate it! I think you have valid points and I will work on the improvements. I will be working on this in my free time, and also try to add some new features. Tracked in https://issues.apache.org/jira/browse/MXNET-1333. Any contribution is welcome.
A light weight, highly optimized mxnet-lite package just like tensorflow-lite is really desired for converting academic researches to products. As well as quantization tools and model file format that supports save the model parameters as int8. Because for mobile device applications, package size and model file size are even more important than computation speed. For the best of my knowledge, tf (along with tf-lite) is the only one that have such abilities. Therefore, if I want my products implemented on both severs and mobile devices, tf will be the only one I will choose, because I really don't want to maintain two sets of codes.
Let's start a discussion here about the roadmap towards MXNet 2.0. We are looking for:
If you have any item that you'd like to propose to have in the roadmap, please do:
Given that this would be a major release, we'd have the opportunity to make backward incompatible changes. This would allow us to visit some topics that require large changes such as dropping support for python2, transitioning fully to cmake, making the tensor library numpy-compatible, or even new programming models.
Now that we decided to follow semantic versioning for releases, it would be a good idea to coordinate features and API changes to make the best use of the next major release. Thus, I propose that we use this issue to track the APIs we'd like to change in the next major version.
The candidates I've collected so far:
download
in #9671Once there are more of such requests, I will try to organize these API-breaking requests better.