Redesign workflow - Githubissues

browsermt / mts

Marian Translation Service

Apache License 2.0

18 stars 4 forks source link

@jerinphilip Based on our discussion on slack and upon some further thought, I'd propose this.

replace members std::vector<std::string> input and std::string translation with std::vector<SentencePieceText> and SentencePieceText, respectively.
Move tokenization from QueuedInput::next() to the PlainTextTranslation constructor, where sentence splitting takes place.
Replace indeed Queued Input and BatchGenerator by a different mechanism.

(Background comment: I'm hesitant to touch the Batch class, as it's fairly baked into Marian, so we need some mechanism of mapping sentence-level translations to the respective requests. Currently the TranslationService has a map that maps from unique sentence ids to the respective Jobs and promises.)

Wrt 3., here's what I'd propose for starters:

for the time being, we keep the mechanism where the TranslationService stores a shared ptr to each translation job in a map that maps from its id (identical to the sentenceId in the respective batch) to the shared ptr and a promise.
for further processing, we pass around weak_ptr. This lets us determine at any point whether a job is still valid.
we add a member function cancelJob(uint64_t job_id) to TranslationService that removes the entry in the map mentioned above.
PlainTextTranslation keeps track of the ids of the jobs it owns. Cancellation of a PlainTextTranslation the cancels all the jobs it owns.
The TranslatoinService (via custom batcher class (to make code maintenance easier)) keeps an array of deques corresponding to the respective sentence lengths, and a heap where the heads of each deques are prioritized so that oldest comes first. When a translation worker requests a batch, we take the oldest sentence and fill up the batch with sentences of the same or a similar length. Each time we pop a sentence/Job from the respective deque, we update the heap. Jobs that aren't valid any more (weak_ptr::lock() returns a shared_ptr to nullptr), can be skipped at batch creation time.

namespace marian { namespace server { class Batcher { // Note: we'll probably also need a mutex for locking. // pending_jobs_ is a vector of deques of pending job, which deque a job is added to depends on the input length in tokens std::vector<std::deque<std::weak_ptr<Job>>> pending_jobs_; // keeps track of the oldest jobs that are still waiting. New batches are build around these. // (For starters, we can revise heuristics later.) Also, we could just sweep over the deques to find the one with the oldest // head if the heap thing is too tedious, at least initially. std::vector<uint16_t> priority_heap_; // map_into_heap_ maps from deque index into the heap,; auxiliary structure for updating the heap. std::vector<uint16_t> map_into_heap_; Ptr<Job> popJob(uint26_t deque_id); // private function to pop a job from deque No. deque_id and update the priority_heap public: void pushJob(Ptr<job> j); // push a job onto the queue Ptr<Batch> popBatch(); // deliver the next batch };

@jerinphilip @kpu

Moved here from slack:

std::future -> callback (for WASM)?

UG - Yes. This will require a bit of thought and work to get it right.

Do we have sentence-boundaries from ssplit? I'm not accounting for this at the moment.

UG - See the ssplit code / header file. SentenceStream can stream into std::string or sentencepiece::min_string_view. So you'll have to replace the std::string snt by the min_string_view and you get the StringView.

I didn't get the part where Uli mentioned having Segment access before translation. Is this a matter of concern?

UG - That was a misunderstanding between Fredrick and Kenneth.

On implementation: I think I have access to SentencePieceText now, I'm doing typedef sentencepiece::SentencePieceText Segment to make life easy. Words (also ranges/string_ref) can be constructed from Segment at some point. This is to solve the following requirement: Your input is some std::string paragraph; Preprocess that into: Array of sentences: And a sentence is some job tracking i.e. a pointer to a request object and an array of

      struct Token {
         Marian vocab id (an integer)
         A string_ref of the original input span.  The string_ref is backed by paragraph.
      };

UG What do we need that for when we have SentencePieceText? Admittedly, we'll need to track SentencePieceText instances to the original input span pre-sentence-splitting.

browsermt / mts

Redesign workflow #6