browsermt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
20 stars 7 forks source link

Import matrix-multiply from a separate wasm module #56

Closed abhi-agg closed 3 years ago

abhi-agg commented 3 years ago

Description

This PR resolves all but last task of the issue https://github.com/browsermt/bergamot-translator/issues/204

List of changes: For generic matrix multiply interface:

Please note that:

  1. We currently import only those functions that are required to support --int8shiftAlphaAll configuration
  2. The value 0.0 is being passed as zero point everywhere.
  3. The wasm target doesn't support 16-bit matrix multiplication (i.e. we are aborting for integer 16-bit matrix multiplication for wasm target).

Added dependencies: none

How to test

Follow the README instructions to run the wasm test page and observe the translated sentences in browser console logs. This confirms that the matrix-multiply implementation is imported from different wasm modules.

Checklist

abhi-agg commented 3 years ago

@kpu @XapaJIaMnu I have made this change for 1 function (int8PrepareA) of matrix multiply interface initially to have an early feedback from you. I can do similar changes for rest of the functions of this interface after your feedback.

For int8Prepare, I have made the following changes (based on my understanding):

  1. Both intgemm::Int8Shift::PrepareA and intgemm::Int8::PrepareA calls are routed to int8PrepareA for wasm target.
  2. The value 0.0 is being passed as zero point currently.

Please let me know if this needs to change and I will make the required changes πŸ‘

kpu commented 3 years ago

This looks like a bunch of WASM binding code. I'm not sure how to review it.

One comment is the fallback or not should be atomic for the entirety of the API. Mixing a native prepareA with a fallback WASM 128 bit multiply won't work because they can have different opaque representations.

Perhaps use the existing CPU model dispatch mechanism in intgemm.cc for your polyfill? Just sense the added functions then treat it like a CPU model.

Also, the fallback should be easy to implement by calling the existing function rather than an abort.

abhi-agg commented 3 years ago

This looks like a bunch of WASM binding code. I'm not sure how to review it.

That's fine. I guess you can ignore that part. I will add @yurydelendik and @eqrion to review that part of the code.

One comment is the fallback or not should be atomic for the entirety of the API. Mixing a native prepareA with a fallback WASM 128 bit multiply won't work because they can have different opaque representations.

I didn't follow you here. It is a bit unclear to me where I am mixing the native prepareA with a fallback WASM 128 bit multiply. Could you please point out? That would be helpful.

As I already mentioned, I made this change just for 1 function (int8PrepareA) of matrix multiply interface to have an early feedback from you. I can do similar changes for rest of the functions of this interface after your feedback.

So, either the native implementation will be called for the entire wasm matrix multiplication interface or the fallback implementation will be called.

Using the pre-processor directive, I separated the code for wasm target from non-wasm targets for matrix multiplication. The non-wasm targets are still unchanged (i.e. use intgemm directly) while for the wasm target, I am calling the new matrix multiply interface now. Specifically, this is the part of the code where I am looking for your feedback πŸ‘

kpu commented 3 years ago

I think Marian is the wrong layer of abstraction for this. Why not just hide it behind the dispatcher in intgemm, treating the presence of JS intrinsics as a CPU model?

https://github.com/kpu/intgemm/blob/master/intgemm/intgemm.cc

That way you get a matrix multiply library that happens to run faster if there's a JS API. And MXNet will just happen to get a faster matrix multiply too.

PrepareA just quantizes so it's compatible across architectures anyway. But e.g. if you had a native PrepareB with CPUID dispatch that picked up 256-bit AVX2 and a WASM Multiply that does 128-bit you'll get wrong results. The function-level polyfill design you have here carries that risk whereas something that checks for all functions to decide what to use does not carry that risk. That said, I presume Firefox would implement all-or-nothing, so the risk is low.

abhi-agg commented 3 years ago

I think Marian is the wrong layer of abstraction for this. Why not just hide it behind the dispatcher in intgemm, treating the presence of JS intrinsics as a CPU model?

https://github.com/kpu/intgemm/blob/master/intgemm/intgemm.cc

That way you get a matrix multiply library that happens to run faster if there's a JS API. And MXNet will just happen to get a faster matrix multiply too.

I am confused now. If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level.

Additionally, Marian can link and work with different gemm libraries (intgemm, fbgemm). Wouldn't marian be the right place to link and use another gemm library that runs efficiently on wasm target? This also keeps a possibility to use different backend implementations if one doesn't want to link to intgemm at all.

PrepareA just quantizes so it's compatible across architectures anyway. But e.g. if you had a native PrepareB with CPUID dispatch that picked up 256-bit AVX2 and a WASM Multiply that does 128-bit you'll get wrong results. The function-level polyfill design you have here carries that risk whereas something that checks for all functions to decide what to use does not carry that risk. That said, I presume Firefox would implement all-or-nothing, so the risk is low.

As you also mentioned, Firefox will implement all-or-nothing (as all of us also agreed in the beginning that implementing only a subset of the gemm functions wouldn't give us the performance gain that we want). In this scenario, we will not have the risk that you mentioned.

On a side note, I have updated this PR to refactor marian code for the rest of the functions of gemm interface as well.

kpu commented 3 years ago

If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level.

Yes, exactly. Now you understand why I was protesting an unnecessary redesign of the interface. If you want an intgemm interface you can have it. If you want a matrix multiply interface that's standards-ready for others to use, implement WebNN. The middle does not serve much purpose.

yurydelendik commented 3 years ago

Sorry, for chiming in late. I don't fully understand lots of concerns mentioned above, probably because different terminology is used.

I think Marian is the wrong layer of abstraction for this.

Maybe or maybe not. The ideal location for fallback implementation is in separate wasm module and in a completely new project, you have to start and maintain as dependency. Keeping fallback logic in the Marian for first iteration will also save you some time, you would spend on dealing of wasm memory instantiation when multiple wasm modules are used.

Now you understand why I was protesting an unnecessary redesign of the interface.

I don't think we are redesigning an interface. We are building the interface suitable for wasm intrinsics. You will continue to use existing interface outside the wasm target.

abhi-agg commented 3 years ago

If I hide everything behind dispatcher in intgemm then that would mean I can just throw away the gemm interface that we recently designed. There are still some subtle differences b/w intgemm interface and the one that we recently designed (callbacks is one of them). With these changes in place, it is difficult to abstract at intgemm level.

Yes, exactly. Now you understand why I was protesting an unnecessary redesign of the interface. If you want an intgemm interface you can have it. If you want a matrix multiply interface that's standards-ready for others to use, implement WebNN. The middle does not serve much purpose.

As I have previously stated few times, the current API is not a redesign of intgemm but an interface that is suitable for intrinsics. This mainly involved changing the intgemm's callbacks/function-pointers style to using arguments. We had to do it because there is no good way now to express function pointer (@yurydelendik can confirm this).

kpu commented 3 years ago

Additionally, Marian can link and work with different gemm libraries (intgemm, fbgemm). Wouldn't marian be the right place to link and use another gemm library that runs efficiently on wasm target? This also keeps a possibility to use different backend implementations if one doesn't want to link to intgemm at all.

There are currently two layers of dispatch: Marian and CPUID in intgemm.

In Marian, you manually configure which fbgemm or fbgemm library you want the Options object and they have peculiar behavior like whether they handle the output matrix. There is currently no automatic dispatch mechanism that switches between the two.

In intgemm it favors CPUID dispatch but one can call the routines directly with the class for that CPUID. The templates would admittedly try to force the existence of functions you haven't implemented (and don't need for our particular case).

This appears to be inserting a third layer in between with dispatch controlled by a compiler predefine. Is there any means of sensing browser support or will it always be two different builds? If yes, is WASM the correct define for this switch?

To give a different example, with the current wormhole instruction, I could (but have not gotten around to) run the wormhole instruction, inspect the output, and then use that as a CPUID of sorts to dispatch different intgemm implementations. Is there an analogous way to sense the presence of an optimized implementation here or is it always two different builds?

andrenatal commented 3 years ago

Hi @kpu, we think we should have a meeting to clarify these issues. I'm sending your an email to have it scheduled. Thanks

abhi-agg commented 3 years ago

@kpu @XapaJIaMnu This PR is up for review. I made all the changes as per our meeting and updated the description of this PR to summarize those changes πŸ‘

abhi-agg commented 3 years ago

@kpu @XapaJIaMnu Could you folks please review the PR?

kpu commented 3 years ago

@abhi-agg We good? I think the changes should be quick.

abhi-agg commented 3 years ago

There are places where code was copied and it looks like a narrower ifdef would have avoided copying code.

@kpu Actually, I used narrower ifdef in the begining but it caused compilation failure on Windows platform. So I had to do it this way.

kpu commented 3 years ago

Can you be more specific about the problem with windows? Something to do wit unused typedef?

  #if defined(WASM)
    return {NodeOp(
      quantMult_ = *child(1)->val()->data();
    // snip
 #else
    typedef typename intgemm_<vtype>::type Integer;
    return {NodeOp(
      quantMult_ = *child(1)->val()->data();
      typedef typename intgemm_<vtype>::type Integer;
abhi-agg commented 3 years ago

Can you be more specific about the problem with windows? Something to do wit unused typedef?

The error is not related to unused typedef. It is following: error C2121: '#': invalid character: possibly the result of a macro expansion

I found the explanation here: https://docs.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2121?view=msvc-160

So, it is with using a preprocessor inside the NodeOp macro expansion. I believe that is why even the COMPILE_CPU preprocessor has a corresponding else part which is empty.

I had found https://stackoverflow.com/questions/28187619/using-preprocessor-conditionals-in-macro-functions-with-double-parentheses but this solution doesn't really solve the problem of code duplication.

If you have some suggestion then I am all ears πŸ‘

kpu commented 3 years ago

NodeOp is a very simple macro.

#define NodeOp(op) [=]() { op; }

Just expand it then use ifdefs inside.
https://github.com/marian-nmt/marian-dev/blob/c29cc83dc4d234f0e3a00a46a729053132b408b8/src/graph/chainable.h#L11

(And don't feel bad about abstraction, there are already places in the code doing that)

abhi-agg commented 3 years ago

@kpu I have pushed the new commits to narrow the ifdef scopes πŸ‘