Data preparation - Githubissues

zwx8981 commented 6 years ago

Hi, thank you for you great work. I have a question of data preparation. To be specific, if I want to use the CNN-based sequence encoder and decoder as standalone modules which can be inserted to other translation models, how should I prepare source dictionary file which can be successfully loaded by fairseq.data.Dictionary.load() method? I read the source code where I find comments in Dictionary.load() method:

    """Loads the dictionary from a text file with the format:

    ```
    <symbol0> <count0>
    <symbol1> <count1>
    ...
    ```
    """

What is the count0 means？

mls1999725 commented 5 years ago

I want to know it, too

jgehring commented 4 years ago

I'm not sure which section of the code you're referring to here, but, generally speaking, the dictionary contains an index-to-symbol mapping as well as frequencies of symbols (in the form of raw counts over the respective source corpus).

facebookresearch / fairseq-lua

Data preparation #130