browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
http://browser.mt
Mozilla Public License 2.0
341 stars 38 forks source link

QE - Distilled model #364

Closed felipesantosk closed 1 year ago

felipesantosk commented 2 years ago

This PR is to track the QE distilled model port from deepQuest to marian.

PyTorch BiRNN model:

BiRNN(
  (_text_field_embedder_src): BasicTextFieldEmbedder(
    (token_embedder_tokens): Embedding()
  )
  (_text_field_embedder_tgt): BasicTextFieldEmbedder(
    (token_embedder_tokens): Embedding()
  )
  (seq2seq_encoder_src): GruSeq2SeqEncoder(
    (_module): GRU(50, 50, batch_first=True, bidirectional=True)
  )
  (seq2seq_encoder_tgt): GruSeq2SeqEncoder(
    (_module): GRU(50, 50, batch_first=True, bidirectional=True)
  )
  (attention): DotProductAttention()
  (_linear_layer_src): Linear(in_features=100, out_features=100, bias=True)
  (_linear_layer_tgt): Linear(in_features=100, out_features=100, bias=True)
  (_dropout): Dropout(p=0.5, inplace=False)
  (_linear_layer): Linear(in_features=200, out_features=1, bias=True)
  (_loss): MSELoss()
)

Port tasks:

Related Marian PR - https://github.com/browsermt/marian-dev/pull/76

jerinphilip commented 2 years ago

@felipesantosk Thank you for opening the requested PRs. I have the following suggestions:

Remove the hardcodes for paths from C++ and Python. In C++ use CLI parsing (there should be a variant of CLI11 includable). In Python you should be able to use argparse.

Both in place, add an additional shell script which will serve as documentation to run both (Python check scripts, C++ converter) and report differences. The shell-script may fetch the Python (deepquest) model and using it then write out the .zips requested in https://github.com/felipesantosk/bergamot-translator/issues/2#issuecomment-1055777416 making the process ahead easier.

It should make running easy for the reviewers in case more hands-on help is required and provide a possibility for attaching a check via GitHub Actions.

Bear in mind you can parallelize development - if you're stuck waiting on inputs from me or @graemenail you can unit test port of Linear Layer etc by means of random inputs (PyTorch should allow you to pick select tensors or nn.Modules). If the units work, the whole should work.

I think the different SentencePiece might run into potential trouble with marian's equivalent SentencePiece vocab (because some parameters are hardcoded in some way) - but we should be able to fix that eventually.

XapaJIaMnu commented 1 year ago

Ended up not being used, sorry about that.