Raw source for fairseq2n code

facebookresearch / fairseq2

FAIR Sequence Modeling Toolkit 2

https://facebookresearch.github.io/fairseq2/

MIT License

695 stars 83 forks source link

Raw source for fairseq2n code #370

Open natgillin opened 8 months ago

natgillin commented 8 months ago

There's a few points in the codebase that imports from fairseq2n code, is there a pointer to the raw source to those?

E.g.

from fairseq2n.bindings.data.text.sentence import ...
from fairseq2n import DOC_MODE
from fairseq2n.bindings.data.text.text_reader import ...

cbalioglu commented 8 months ago

Hey @natgillin, you can find the fairseq2n (fairseq2 Native) source code under https://github.com/facebookresearch/fairseq2/tree/main/native.

natgillin commented 8 months ago

Thanks @cbalioglu for the pointer!

Regarding the fairseq2n.bindings, is there a way to avoid them? I see that they are mainly used in https://github.com/facebookresearch/fairseq2/blob/main/src/fairseq2/data/text/sentencepiece.py

Is there a way to load directly from https://github.com/google/sentencepiece?tab=readme-ov-file#overview instead of

load_basic_sentencepiece_tokenizer = StandardTextTokenizerLoader(
    default_asset_store,
    default_download_manager,
    lambda path, _: BasicSentencePieceTokenizer(path),
)

from https://github.com/facebookresearch/fairseq2/blob/main/src/fairseq2/data/text/sentencepiece.py#L225C1-L229C2?

cbalioglu commented 8 months ago

Hey @natgillin, I might be more helpful if you can tell me what you want to achieve. The sentencepiece implementation in fairseq2 is fully compatible with Google's sentencepiece. In fact, in fairseq2n we use the native API of sentencepiece.

natgillin commented 7 months ago

Thanks for the explanation!

We're trying to merge as much code as possible from fairseq2 to our own fairseq fork since we're not sure if that'll eventually be the case on the public repository. We are having some of the decoder-only models into fairseq works in by copying some of the code blocks in fairseq2 , but we found some dependencies of the fairseq.data.text relying on fairseq2n.

Removing the fairseq2n dependency would have essentially allow us to backport some fairseq2 features/models support to fairseq.