I have a dataset of English sentences, where all words are replaced by token IDs (for eg, "hello" ->
3, "there" -> 5, "potato" -> 42). I want to use unsupervised word translation as a means to analyze how well reverse engineering such a dataset is possible, since this case can be seen like that of unsupervised translation, with the unknown language having an exact corresponding word in English.
I use embeddings for 2 languages:
English embeddings obtained via FastText (using the pre-trained embeddings available in the fastText repo)
Embeddings obtained for custom data (25000 sentences) with FastText. Since the tokens are not real words, I do not use character-level features.
I then go on to normalise both embeddings using unsupervised.py, and then visualize some examples using the demo.ipynb notebook. I've tried several variations so far (normalized/non-normalized embeddings), but the kind of results I am getting are quite bad. Is there anything I can do to overcome this, or is this the limit of reconstruction (given the low amount of data: 25k sentences). If so, can I make a loose claim that doing such a thing (which basically translates to word-level substitution cipher) is very hard to break using unsupervised word translation?
P.S. Here's the training log. On a side note, the discriminator loss seems to be constant throughout the training.
I have a dataset of English sentences, where all words are replaced by token IDs (for eg, "hello" -> 3, "there" -> 5, "potato" -> 42). I want to use unsupervised word translation as a means to analyze how well reverse engineering such a dataset is possible, since this case can be seen like that of unsupervised translation, with the unknown language having an exact corresponding word in English.
I use embeddings for 2 languages:
I then go on to normalise both embeddings using unsupervised.py, and then visualize some examples using the demo.ipynb notebook. I've tried several variations so far (normalized/non-normalized embeddings), but the kind of results I am getting are quite bad. Is there anything I can do to overcome this, or is this the limit of reconstruction (given the low amount of data: 25k sentences). If so, can I make a loose claim that doing such a thing (which basically translates to word-level substitution cipher) is very hard to break using unsupervised word translation?
P.S. Here's the training log. On a side note, the discriminator loss seems to be constant throughout the training.