[v0.10.x] Softmax optimization & bertpass refactor

bgawrych commented 3 years ago

This PR adds graph pass to optimize CPU's softmax on BERT. Currently for BERT-large length tensor is created by following operations: expand_dims -> brodcast_axis -> Reshape and there is x24 such tensor creation. This pass replace softmax (with length) with regular softmax (but with masked input) - mask is created only once and then is passed to elemwise_sum to mask input. Applying pass in the scripts is optional

Original:

Masked softmax:

Thoughput in samples/s:

batches	batch_size	fp32	quantized	quantized + mha_interleave	quantized + mask_softmax	quantized + mask_softmax + mha_interleave
1000	24	19,74	25,08	25,26	34,36	34,80
500	1	14,06	16,97	18,02	20,24	21,81

Accuracy:

batch_size	fp32	quantized	quantized + mha_interleave	quantized + mask_softmax	quantized + mask_softmax + mha_interleave
EM	80,99	80,91	80,44	80,73	80,44
F1	88,60	88,33	88,06	88,29	88,06

There is also bug fix in interrleaved mha pass Accuracy without mha_interleave bug fix: {'exact_match': 79.62157048249763, 'f1': 87.75497143592598}

bartekkuncer commented 3 years ago

LGTM

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1565/d13d37d19e549bb13984e855b9f3e6cb24a4bbc6/index.html

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1565/ad9185846d06baf328878fa7b37a5356a6439c89/index.html

bgawrych commented 3 years ago

@szha Can you help with CI? I'm not sure why it's failing and don't know how to rerun it. website-build seems to fail on notebook I haven't edited

barry-jin commented 3 years ago

Hi @bgawrych, Could you try to merge with v0.10.x, we ported new CI settings from v0.x to v0.10.x. Thanks!

bgawrych commented 3 years ago

@barry-jin, @szha still issue with notebook, little strange as bert.md file was not changed for 1 year - should I fix it or it's CI issue?

leezu commented 3 years ago

@barry-jin I see the following error in the log, pointing out that there's a CI issue:

[2021-05-10T05:59:14.916Z] umount: /dev/shm: must be superuser to unmount.
[2021-05-10T05:59:14.917Z] mount: /dev/shm: permission denied.
[2021-05-10T05:59:14.917Z] ./gluon_nlp_job.sh: line 33: sudo: command not found

barry-jin commented 3 years ago

@barry-jin, @szha still issue with notebook, little strange as bert.md file was not changed for 1 year - should I fix it or it's CI issue?

Hi @bgawrych, we have ported the changes in bert.md from v0.x branch, you could try to merge with current v0.10.x.

barry-jin commented 3 years ago

@barry-jin I see the following error in the log, pointing out that there's a CI issue:

[2021-05-10T05:59:14.916Z] umount: /dev/shm: must be superuser to unmount.
[2021-05-10T05:59:14.917Z] mount: /dev/shm: permission denied.
[2021-05-10T05:59:14.917Z] ./gluon_nlp_job.sh: line 33: sudo: command not found

Thanks, I will fix this issue.

bgawrych commented 3 years ago

@barry-jin

[2021-05-19T09:19:56.267Z] Exception occurred:
[2021-05-19T09:19:56.267Z]   File "/workspace/gluon-nlp/docs/conf.py", line 237, in setup
[2021-05-19T09:19:56.267Z]     app.add_javascript('google_analytics.js')
[2021-05-19T09:19:56.267Z] AttributeError: 'Sphinx' object has no attribute 'add_javascript'

barry-jin commented 3 years ago

@barry-jin

[2021-05-19T09:19:56.267Z] Exception occurred:
[2021-05-19T09:19:56.267Z]   File "/workspace/gluon-nlp/docs/conf.py", line 237, in setup
[2021-05-19T09:19:56.267Z]     app.add_javascript('google_analytics.js')
[2021-05-19T09:19:56.267Z] AttributeError: 'Sphinx' object has no attribute 'add_javascript'

Will be fixed in https://github.com/dmlc/gluon-nlp/pull/1575

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1565/1644e8fe25e66b05300042510ced0603d1dd4098/index.html

szha commented 3 years ago

Merged. Thanks @bgawrych!

ptrendx commented 3 years ago

I'm suprised that the elimination of the 24x mask tensor creation gave you any speedup (as opposed to using masked softmax, which should) - MXNet already has common expression elimination pass (I wrote it: https://github.com/apache/incubator-mxnet/pull/15657). Does that not work for you?

bgawrych commented 3 years ago

@ptrendx I didn't know about this feature, but I wrote this small graph pass to test it:

#if MX_LIBRARY_VERSION <= 7
MXReturnValue TEST(const std::string& in_graph, const std::string** out_graph,
                          const std::unordered_map<std::string, std::string>& options,
                          const std::unordered_map<std::string, MXTensor>& args,
                          const std::unordered_map<std::string, MXTensor>& aux,
                          const PassResource& res) {
  Graph *g = Graph::fromString(in_graph);
#else
MXReturnValue TEST(mxnet::ext::Graph *g,
                          const std::unordered_map<std::string, std::string>& options) {
#endif

  Node* commonnode;
#if MX_LIBRARY_VERSION <= 7
  for(Node* n : g->nodes) {
#else
  for(int i=0; i < g->size(); i++) {
    Node* n = g->getNode(i);
#endif
    if (n->op.compare("softmax") == 0) {
      commonnode = n->inputs[1].node;
      break;
    }
  }

#if MX_LIBRARY_VERSION <= 7
  for(Node* n : g->nodes) {
#else
  for(int i=0; i < g->size(); i++) {
    Node* n = g->getNode(i);
#endif
    if (n->op.compare("softmax") == 0) {
      n->inputs[1].node = commonnode;
    }
  }

#if MX_LIBRARY_VERSION <= 7
  // convert back to JSON string from Graph/Node
  *out_graph = new std::string(g->toString());
#endif
  return MX_SUCCESS;
}

Overhead from these operators are negligible, but seems like it don't work in this case: with graph pass:

w/o

dmlc / gluon-nlp

[v0.10.x] Softmax optimization & bertpass refactor #1565