Open natgillin opened 8 months ago
Hey @natgillin, you can find the fairseq2n (fairseq2 Native) source code under https://github.com/facebookresearch/fairseq2/tree/main/native.
Thanks @cbalioglu for the pointer!
Regarding the fairseq2n.bindings
, is there a way to avoid them? I see that they are mainly used in https://github.com/facebookresearch/fairseq2/blob/main/src/fairseq2/data/text/sentencepiece.py
Is there a way to load directly from https://github.com/google/sentencepiece?tab=readme-ov-file#overview instead of
load_basic_sentencepiece_tokenizer = StandardTextTokenizerLoader(
default_asset_store,
default_download_manager,
lambda path, _: BasicSentencePieceTokenizer(path),
)
Hey @natgillin, I might be more helpful if you can tell me what you want to achieve. The sentencepiece implementation in fairseq2 is fully compatible with Google's sentencepiece. In fact, in fairseq2n we use the native API of sentencepiece.
Thanks for the explanation!
We're trying to merge as much code as possible from fairseq2 to our own fairseq fork since we're not sure if that'll eventually be the case on the public repository. We are having some of the decoder-only models into fairseq works in by copying some of the code blocks in fairseq2 , but we found some dependencies of the fairseq.data.text
relying on fairseq2n
.
Removing the fairseq2n
dependency would have essentially allow us to backport some fairseq2 features/models support to fairseq.
There's a few points in the codebase that imports from fairseq2n code, is there a pointer to the raw source to those?
E.g.
from fairseq2n.bindings.data.text.sentence import ...
from fairseq2n import DOC_MODE
from fairseq2n.bindings.data.text.text_reader import ...