Open ugermann opened 3 years ago
Sketch of the Batcher class:
namespace marian {
namespace server {
class Batcher {
// Note: we'll probably also need a mutex for locking.
// pending_jobs_ is a vector of deques of pending job, which deque a job is added to depends on the input length in tokens
std::vector<std::deque<std::weak_ptr<Job>>> pending_jobs_;
// keeps track of the oldest jobs that are still waiting. New batches are build around these.
// (For starters, we can revise heuristics later.) Also, we could just sweep over the deques to find the one with the oldest
// head if the heap thing is too tedious, at least initially.
std::vector<uint16_t> priority_heap_;
// map_into_heap_ maps from deque index into the heap,; auxiliary structure for updating the heap.
std::vector<uint16_t> map_into_heap_;
Ptr<Job> popJob(uint26_t deque_id); // private function to pop a job from deque No. deque_id and update the priority_heap
public:
void pushJob(Ptr<job> j); // push a job onto the queue
Ptr<Batch> popBatch(); // deliver the next batch
};
@jerinphilip @kpu
Moved here from slack:
std::future -> callback (for WASM)?
UG - Yes. This will require a bit of thought and work to get it right.
Do we have sentence-boundaries from ssplit? I'm not accounting for this at the moment.
UG - See the ssplit code / header file. SentenceStream can stream into std::string or sentencepiece::min_string_view. So you'll have to replace the std::string snt by the min_string_view and you get the StringView.
I didn't get the part where Uli mentioned having Segment access before translation. Is this a matter of concern?
UG - That was a misunderstanding between Fredrick and Kenneth.
On implementation: I think I have access to SentencePieceText now, I'm doing typedef sentencepiece::SentencePieceText Segment to make life easy. Words (also ranges/string_ref) can be constructed from Segment at some point. This is to solve the following requirement: Your input is some std::string paragraph; Preprocess that into: Array of sentences: And a sentence is some job tracking i.e. a pointer to a request object and an array of
struct Token {
Marian vocab id (an integer)
A string_ref of the original input span. The string_ref is backed by paragraph.
};
UG What do we need that for when we have SentencePieceText? Admittedly, we'll need to track SentencePieceText instances to the original input span pre-sentence-splitting.
@jerinphilip Based on our discussion on slack and upon some further thought, I'd propose this.
std::vector<std::string> input
andstd::string translation
withstd::vector<SentencePieceText>
and SentencePieceText, respectively.(Background comment: I'm hesitant to touch the Batch class, as it's fairly baked into Marian, so we need some mechanism of mapping sentence-level translations to the respective requests. Currently the TranslationService has a map that maps from unique sentence ids to the respective Jobs and promises.)
Wrt 3., here's what I'd propose for starters: