szha commented 6 years ago

Let's start a discussion here about the roadmap towards MXNet 2.0. We are looking for:

New features that are useful to your research and development.
Improvements and patches to existing features.
APIs that should be fixed.

If you have any item that you'd like to propose to have in the roadmap, please do:

Create (or locate existing) issue for the item, note the issue number.
Comment in this issue: 1) the above issue number, 2) one sentence of what the item is about and why it's useful to you.
Indicate whether you'd be willing to help out on the item.

Given that this would be a major release, we'd have the opportunity to make backward incompatible changes. This would allow us to visit some topics that require large changes such as dropping support for python2, transitioning fully to cmake, making the tensor library numpy-compatible, or even new programming models.

Now that we decided to follow semantic versioning for releases, it would be a good idea to coordinate features and API changes to make the best use of the next major release. Thus, I propose that we use this issue to track the APIs we'd like to change in the next major version.

The candidates I've collected so far:

remove legacy ops such as batch-norm v1
reorganizing namespace for utility functions such as download in #9671
transform argument in the constructor of existing vision dataset API.

Once there are more of such requests, I will try to organize these API-breaking requests better.

bhavinthaker commented 6 years ago

Do we have sufficient automated testing to catch accidental lapses?

If not, can we have a volunteer to work on writing these automated test-cases? How do we track this task?

larroy commented 6 years ago

Refactors of the cpp-package and other C++ APIs. I would like that.

fhieber commented 6 years ago

consistent use of "axis"/"dim(s)" keyword arguments in all operators, for example:
- swapaxes uses dim1, dim2
- expand_dims uses axis
expose optimizer in Module API

anirudhacharya commented 6 years ago

@sandeep-krishnamurthy Please tag this - API Change, Call for Contribution, Roadmap.

szha commented 6 years ago

9881

eric-haibin-lin commented 6 years ago

kvstore should not be public API

szha commented 6 years ago

we should merge element wise ops with broadcast ops and dispatch the different implementation only based on the shape, so that symbol and ndarray +-*/ are consistent

szha commented 6 years ago

contrib.ctc_loss should make into supported operator.

szha commented 6 years ago

11031

szha commented 6 years ago

10807

RogerChern commented 6 years ago

fix_gamma=False for mxnet.symbol.BatchNorm

szha commented 6 years ago

11141

szha commented 6 years ago

11134

szha commented 6 years ago

Gluon RNN layer parameters are currently saved through unfused cells, causing the name to be something like "_unfused.0.l_cell.weight". This caused trouble in #11482 when I removed unfused cells. The workaround is to override _collect_params_with_prefix function to add the prefix. In 2.0, we should:

[ ] remove the _collect_params_with_prefix in Gluon RNN layer
[ ] write a converter for parameter formats
[ ] start versioning parameter files.

szha commented 6 years ago

11953

szha commented 6 years ago

12197 in using integer types for index instead of float.

anirudhacharya commented 6 years ago

Taking a brief look at Data Iterators, it would seem the iterators are split up between the mx.io module and mx.image module. And there does not seem to be any method/process( correct me if I am wrong) in the split. For instance,

ImageRecordIter and ImageRecordUInt8Iter( along with CSVIter, NDArrayIter, LibSVMIter) are under mx.io
whereas ImageIter and ImageDetIter are under mx.image module

And

There are image processing functions like imdecode, scale_down, resize_short etc.. under mx.image here - http://mxnet.incubator.apache.org/api/python/image/image.html#image-processing-functions
And there are very similar image augmenter functions like - ResizeAug, CenterCropAug, etc.. here - http://mxnet.incubator.apache.org/api/python/image/image.html#image-iterators

Is there any specific reason for this kind of design? It might be good to take a relook at this and reorganize this, even if it leads to breaking a few APIs.

And there is similar functionality in the gluon interface too( and I am not including that in this discussion).

anirudhacharya commented 6 years ago

3. transform argument in the constructor of existing vision dataset API.

What is the proposed change here? is the plan to remove transform as an argument to the constructor?

szha commented 6 years ago

@anirudhacharya yes, because dataset interface has a .transform method that serves the same purpose but strictly more flexible.

zhreshold commented 6 years ago

@anirudhacharya

I can see your concern, but iterators included in mxnet.image are specifically designed for images and serve as compliments to the purpose agnostic iterators in mxnet.io.

Same stories apply to all these image transformation functions provided in mxnet.image, users basically can use them as an opencv alternative in order to process mx.nd.NDArray instead of numpy.array

anirudhacharya commented 6 years ago

@zhreshold but ImageRecordIter and ImageRecordUInt8Iter which are image specific are defined under mx.io.

With regards to image transforms, I was thinking the symbolic interface should also have something similar to the interface available in gluon-CV transforms - https://gluon-cv.mxnet.io/api/data.transforms.html, which is very intuitive and not cluttered. Because we have users who have gone to production using MXNet's symbolic interface. We can discuss this in-person, it will be better.

kohr-h commented 6 years ago

My 2 cents: The logically flawed all-integer indexing should be fixed:

>>> arr = mx.nd.ones((2, 3))
>>> arr[0, 0]  # what it is currently
[1.]
<NDArray 1 @cpu(0)>
>>> arr[0, 0]  # what it should be
1.0
<NDArray  @cpu(0)>

In the same way, an array of shape () should be convertible to scalar via asscalar(), which it currently isn't.

See the discussion here for context.

szha commented 6 years ago

@kohr-h yes, having 0th order tensor support is important. Regarding the approach, we may not want to implement it as an asscalar() call, since that's a blocking call. Instead, we should extend the tensor definition so that calculation involving such scalar can still be asynchronous.

kohr-h commented 6 years ago

@szha Absolutely, the more natural the support for 0-th order tensors, the better.

My remark was more aimed at the current exception that you get when calling

>>> arr[0, 0].reshape(()).asscalar()
...
ValueError: The current array is not a scalar

That could actually be fixed already now, unless there is code that expects this to fail.

szha commented 5 years ago

In 2.0, the CAPI should not expose backend-specific types, such as mkldnn::memory.

szha commented 5 years ago

~log_softmax return type https://github.com/apache/incubator-mxnet/pull/14098#pullrequestreview-201770450~ (resolved with optional dtype argument)

szha commented 5 years ago

https://lists.apache.org/thread.html/a969d92e32f39e9540f3afd3d3a594efb0591083669a79e1accd02d4@%3Cdev.mxnet.apache.org%3E

wkcn commented 5 years ago

I have some suggestions.

Custom Operator for deployment Currently, MXNet doesn't support the custom operator for deployment unless rebuilding from the source. Although there is the MXCustomOpRegister API, it is inconvenient and does not provide mshadow::Stream, which is important for executing an operator asynchronously. We need an approach (easy to write and compile) to support custom operator for deployment.
Refactor of the C++ package I hope that the syntax of the C++ package is the same as that of the Python package, and it will be more friendly for C++ users.

Edit: I wrote a prototype of Custom Operator Class. https://github.com/wkcn/DLCustomOp

hkingtswcbyy commented 5 years ago

Mxnet-lite based on NNAPI and quantization tools.

kohillyang commented 5 years ago

+1 for a better custom OP API

larroy commented 5 years ago

Would it be good to be able to to add tag / name to ndarrays so when executing a graph you know which array is what. This would be an optional name as a string to the ndarray constructor. It will also pass this name inside the graph. For example when marking variables for autograd we are just setting: nnvm::Symbol::CreateVariable("var" + str_c).outputs[0].node, 0, 0};

which is just a number, not very useful. also similar with node_17. It doesn't help debugging.

larroy commented 5 years ago

@wkcn what do you mean "For deployment" WRT custom op?

rondogency commented 5 years ago

https://github.com/apache/incubator-mxnet/issues/12681 Gluon batch_norm parameter name running_xxx should be renamed to moving_xxx for consistency

apeforest commented 5 years ago

topk operator: ret_typ argument should be return_type. Also, return_type should be int64 not float32.

wkcn commented 5 years ago

@larroy If a model which need to be deployed uses custom operator, it's necessary to rewrite the custom operator in C++/C language. It will be better if we can write C++/C custom operator without rebuilding the source of MXNet. Reference: https://cwiki.apache.org/confluence/display/MXNET/Dynamic+CustomOp+Support

NEWPLAN commented 5 years ago

I'd like to share my opinions on the roadmap to build MXNet 2.0.

The organization of source files and namespace are not friendly now for the redundancy and chaos. I have understood the origin design wills that building a simple system with pluggable components (e.g., KV store, NDArray, TVM, etc.). However, this embedding source codes manner may result in chaos and increase the complexity of a system. From my experiences, When I want to build the MXNet in debug mode. the packed dynamic libraries always go to failure, because the size of libraries has exceeded the system limitation (By default, an executable file cannot use more than 2GB storage? it is partially solved by masking the unused CUDA generation codes). I used to look through the source code of Caffe, which is really an art and may be a good template for us to decide how to design a better MXNet in future. For example, can we use the third dependencies as pluggable dynamic libraries, instead of embedding their source code and building together?

I have experienced many DL frameworks such as TensorFlow (another big chaos, too), PyTorch, MXNet, Caffe, etc, and I finally choose the MXNet which I want to contribute to. MXNet has the best performance in terms of design wills, throughput, memory cost, and scalability, but it seems that we are fading away from both users and contributors. I hope we can try our best to make such an extraordinary DL framework great again.

F.Y.I.

KellenSunderland commented 5 years ago

https://github.com/apache/incubator-mxnet/issues/10692 might require an API change to fix.

Edit: reference issue is closed but we may want to think about tweaking the monitor callback api.

hcho3 commented 5 years ago

... such as dropping support for python2

Was there a decision to drop Python 2? If so, we should consider putting our logo to https://python3statement.org/.

cloudhan commented 5 years ago

The CMake for all DMLC projects needs a throughout refactor. It is simply a disaster when integrating with dmlc-core, xgboost, mxnet, etc. Some of them do not have install. None of them export targets, IIRC. This results in the users of these packages being responsible for maintaining the compiler definition and linking dependencies, which should not be the case.

Should bump the cmake_minimum_required to 3.5.1, and make use the target based commands, but some quirk exist. If you guys be more aggressive (install cmake in user local workspace), then 3.12 would be the best choice.

Distro	CMake version	Status
Ubuntu 16.04	3.5.1	Most Popular In DL
Debian 8 “Jessie”	3.0.2	EOL on 2018-06-17, EOL LTS on ~2020-06-30
Debian 9 “Stretch”	3.7.2	Current Stable
CentOS	latest via EPEL

FYI, currently I take dmlc's CMakeLists.txt as the lower bound...

hcho3 commented 5 years ago

@cloudhan The latest version of dmlc-core exports a CMake target. See https://github.com/dmlc/dmlc-core/blob/master/doc/build.md

Same goes for XGBoost: https://github.com/dmlc/xgboost/blob/master/demo/c-api/README.md

aaronmarkham commented 5 years ago

I'd like to see mobile device support. Pre-built binaries that developers can use for Android development. And iOS.

larroy commented 5 years ago

I would like to refactor the operator graph and have it in the MXNet codebase, tracking this proposal in the wiki: https://cwiki.apache.org/confluence/display/MXNET/MXVM%3A+Operator+graph+2.0

cloudhan commented 5 years ago

Maybe make backward propagation computation optional when compile, specifically using for deploying.

arcadiaphy commented 5 years ago

The param file of gluon model and the param file exported from gluon HybridBlock have different naming style. It will be more convenient to have only one of them and just output json symbol file from mx.gluon.HybridBlock.export.

When you have lost gluon model file and only have the exported param file, it's very hard to convert it back.

mouryarishik commented 5 years ago

I'd like to share my opinions on the roadmap to build MXNet 2.0.

There should be the only gluon Sequential api which could be hybridizable instead of having 2 separate Sequential and HybridSequential apis.

A fit method like api for gluon models would be a dream come true. For example:-


model = nn.Sequential()
...
...
...
model.initialize(mx.init.Xavier(), mx.gpu())
model.hybridize()
model.build(loss = 'cross entropy', optimizer = 'adam')
model.train(train_data, val_data, epochs = 10)

Should print something like:-

Epoch(01/10) [=========================>] Training -> Loss: 0.78415 Accuracy: 0.74588 Validation -> Loss: 0.78415 Accuracy: 0.74588

Epoch(02/10) [=========================>] Training -> Loss: 0.64415 Accuracy: 0.79588 Validation -> Loss: 0.70405 Accuracy: 0.75588

Epoch(03/10) [=========> ] Training -> Loss: 0.58475 Accuracy: 0.82588 Validation -> Loss: 0.68454 Accuracy: 0.79588



- Computing higher order gradient is the only critically lacking feature of MXNet.
- The official website for MXNet lags like hell. A better looking and good official website is desperately needed. 
  There is a [beta version of the new MXNet website](https://beta.mxnet.io/).  
  But it's in beta from almost something like two years. What's taking so long!! Websites for other frameworks like Tensorflow, Pytorch are updated for more than like 5 times within just a year!!

BTW Thanks to everyone for all the contributions so far. MXNet is by far my favourite framework(after using Tensorflow(1.x - 2.0) and Pytorch). I'm currently busy in some personal stuff, but will definitely contribute after some time.

Thanks for your time.

szha commented 5 years ago

MXNet is by far my favourite framework

Thanks for the encouraging words! It makes all the efforts worthwhile. The items listed are indeed good suggestions (and some of them are already available (: )

There should be the only gluon Sequential api which could be hybridizable instead of having 2 separate Sequential and HybridSequential apis.

HybridSequential is a HybridBlock which only allows HybridBlock children and hence the current design.

A fit method like api for gluon models would be a dream come true.

There's an estimator class with the fit function in contrib now: https://github.com/apache/incubator-mxnet/pull/15009/files#diff-7f58d4d4cb6c2e6088afa89097fbb7e3R34. It can be extended to support progress bar (cc @roywei)

Computing higher order gradient is the only critically lacking feature of MXNet.

Yes, some community members are pooling efforts and working on it (cc @apeforest). Proposal can be found here: https://cwiki.apache.org/confluence/display/MXNET/Higher+Order+Gradient+Calculation

The official website for MXNet lags like hell.

(cc @aaronmarkham) Is it due to network or is the javascript not running smoothly? If the former we're more than happy to look into the CDN options. Just let us know where you're accessing the website from.

mouryarishik commented 5 years ago

There's an estimator class with the fit function in contrib now: https://github.com/apache/incubator-mxnet/pull/15009/files#diff-7f58d4d4cb6c2e6088afa89097fbb7e3R34. It can be extended to support progress bar (cc @roywei)

Thanks for pointing out that fit api is already available.

Is it due to network or is the javascript not running smoothly?

It's the javascript that always says something like "Processing math: 10%". I'd like to see the beta version as the official website of MXNet, it's more faster and has better looking UI.

Thanks for the encouraging words! It makes all the efforts worthwhile. The items listed are indeed good suggestions (and some of them are already available (: )

Anytime 🍻

mouryarishik commented 5 years ago

I'd like to give some suggestions to improve Estimator experience.

The existing "evaluate" method should be renamed to "_evaluate" and a new "evaluate" should be defined for users to quickly evaluate the model on some test data. For example:-
```
estimator.evaluate(test_data, [mx.metric.Accuracy(), mx.metric.TopKAccuracy()])
```
Currently does the evaluation internally and update the provided metrics but "prints nothing". Should print something like:-
```
accuracy: 0.97852
top_k_accuracy_3: 0.12589
```

If the user has provided no eval_data, then print only training loss and accuracy(or other metrics if provided)

estimator.fit(train_data = train_data, epochs = 10)

Currently prints:

[Epoch 0] Finished in 13.548s, train accuracy: 0.3320, train softmaxcrossentropyloss: 1.7761, validation accuracy: nan, validation softmaxcrossentropyloss: nan

Should print

[Epoch 0] Finished in 13.548s, train accuracy: 0.3320, train softmaxcrossentropyloss: 1.7761

If the user has provided eval_data then instead of printing train and eval metrics in one line(which makes difficult to read), should print in two separate 2 lines. .fit(....) currently prints:

[Epoch 0] Finished in 16.062s, train accuracy: 0.7316, train softmaxcrossentropyloss: 0.7813, validation accuracy: 0.7237, validation softmaxcrossentropyloss: 0.8057

Should print

Epoch (01/10):
Training   ->  Loss: 0.78134  Accuracy: 0.73165 
Validation ->  Loss: 0.78415  Accuracy: 0.80577

Printing time taken per epoch actually kills performance by 20%, so I don't think that there is any need to print the time taken by the model to train per epoch unless provided by the user(we can add an additional argument "benchmark" to .fit call, False by default, if True only then print the time taken per epoch and total time taken.
If the user has used no learning rate scheduler then there is no need to print [Epoch 5] Begin, current learning rate: 0.0010 each iteration.
There should be a "history" method that returns all list of losses and metrics values we have encountered while training so that user can visualize them by plotting them on some visualization framework like matplotlib. For example:-
```
history = estimator.history()
```

plt.plot(history[0]) plt.title('Training Loss')

plt.plot(history[1]) plt.title('Training Accuracy')


Thanks again.

roywei commented 5 years ago

Hi @mouryarishik Thank you so much for such detailed suggestions, really appreciate it! I think you have valid points and I will work on the improvements. I will be working on this in my free time, and also try to add some new features. Tracked in https://issues.apache.org/jira/browse/MXNET-1333. Any contribution is welcome.

hkingtswcbyy commented 5 years ago

A light weight, highly optimized mxnet-lite package just like tensorflow-lite is really desired for converting academic researches to products. As well as quantization tools and model file format that supports save the model parameters as int8. Because for mobile device applications, package size and model file size are even more important than computation speed. For the best of my knowledge, tf (along with tf-lite) is the only one that have such abilities. Therefore, if I want my products implemented on both severs and mobile devices, tf will be the only one I will choose, because I really don't want to maintain two sets of codes.

apache / mxnet

[Discussion] MXNet 2.0 Roadmap (was: APIs that might be a good idea to break in 2.0) #9686

9881

11031

10807

11141

11134

11953

12197 in using integer types for index instead of float.