Extension to other languages

youraveragesciencepal commented 4 years ago

Hello, I am interested in training the model to other languages such as Spanish and Turkish but not sure how to generate the strokes sequence.Any help will be highly appreciated. Thanks

Grzego commented 4 years ago

Hi @YourAverageSciencePal,

For other languages you would need a dataset that is somewhat similar to the original one. It should include the pen strokes and the corresponding text information.

The pen strokes are stored as consecutive points.

So the letter above would be represented this way:

[[27, 18, 0],
 [24, 16, 0],
 [21, 16, 0],
 [16, 19, 0],
 [14, 25, 0],
 [16, 31, 0],
 [21, 32, 0],
 [26, 28, 0],
 [27, 23, 0],
 [28, 18, 0],
 [27, 28, 0],
 [29, 31, 1]]  # 1 means this is the end of a stroke

Where first two numbers are representing coordinates, and the third number says whether it's the end of a stroke.

With above sequence there should also be a label. In case of single letter it would be just: "a". When translated to a number it would end up as [54].

For example "quick brown fox ..." would be encoded as [70, 74, 62, 56, 64, 1, 55, 71, 68, 76, 67, 1, 59, 68, 77, 1, 13, 13, 13]. The numbers associated with each letter depend on how many letters we have in the dataset (but are otherwise arbitrary).

youraveragesciencepal commented 4 years ago

Thank you so much for your reply

youraveragesciencepal commented 4 years ago

Hello, is there any software through which we generate this sequence on our own?I cannot seem to find any. Also, can you elaborate a bit more on the structure of your .npy file.

Grzego commented 4 years ago

@YourAverageSciencePal, unfortunately I don't know of any software that could be used here.

As for the elaboration on the .npy files I hope this little script below will be of help for you. In case of any more questions feel free to ask. :)

import numpy as np
import matplotlib.pyplot as plt
import pickle

# load the preprocessed data from the file
data = np.load('data/dataset.npy', allow_pickle=True)
# let's look whats it's shape
print(data.shape)
# >> (10867,)
# so it's more of a list of examples
# now if we look into some example
print(data[0].shape)
# >>> (855, 3)
# this array stores consecutive points that represent a pen stroke
# if you print first few points
print(data[0][:5])
# >>> [[ 0.82798475 -4.2939095   0.        ]
#      [ 0.7848605  -4.3370337   0.        ]
#      [ 0.8193599  -4.2852845   0.        ]
#      [ 0.81073505 -4.2939095   0.        ]
#      [ 0.8021102  -4.2939095   0.        ]]
# you will see that we store (x, y, e) in each row
# x and y represent coordinates, and e holds special information on
# whether or not after that point we will "lift" the pen (and because it
# is lifted after that point, we wouldn't see the line between those points)

# let's plot first example ignoring `e` part for now
example = data[0]
plt.plot(example[:, 0], -example[:, 1])  # y coordinate is inverted, 
                                         # but that's not really important
plt.show()
# this should display a single example from the dataset, as you can see
# it looks like someone didn't lift a pen during writing
# now let's include information stored in `e`
lifts = np.where(example[:, 2] == 1.)[0] + 1  # we do +1 here because we want to
                                              # split after lifted point
splited = np.split(example, lifts)
for s in splited:
    plt.plot(s[:, 0], -s[:, 1])
plt.show()
# this should display a single example but ignoring the edges
# when a pen is "lifted"

# now let's move to labels and translation files
labels = np.load('data/labels.npy', allow_pickle=True)
translation = pickle.load(open('data/translation.pkl', 'rb'))
# look at labels shape
print(labels.shape)
# >>> (10867,)
# it should be the same as `data` because we need a label for each
# example that's present in the dataset
print(labels[0])
# >>> [29, 78, 1, 47, 71, 58, 75, 68, 71, 1, 50, 62, 65, 65, 
#      62, 54, 66, 72, 13, 1, 28, 1, 66, 68, 75, 58]
# each number in this array represents a letter which we can decode using
# the reversed translation dictionary (we need to reverse it, because it was
# created to be used during the generation, where we need to convert text into
# the numerical labels, but in this example we want to do the reverse)
reversed_translation = {v: k for k, v in translation.items()}
print(''.join(reversed_translation[x] for x in labels[0]))
# >>> "By Trevor Williams. A move"
# which should show the same text we could previously read on plots

youraveragesciencepal commented 4 years ago

Hello, after several months, I am able to bring data into the array format. You have shown in the example above where the 1 indicates that the pen has been lifted off. Now all the characters 37 of them with 77 variations each have the coordinates in the xyz coordinates with (z=0,1). So now I do not need to do any data-preprocessing right? Also any tips how to convert it into the desired numpy format you have used. I have all the coordinates in a csv format. Thanking you again.

espetro commented 4 years ago

Hi @YourAverageSciencePal I am also researching on how to create new datasets with different alphabets to train the model. How did you finally collected the variations? Did you create a tool or followed any guides? 🤔

espetro commented 4 years ago

@Grzego I see that your model generates the output in a matplotlib figure. Let's say I generate this word from Spanish:

spanish word

Actually, in Spanish it has an accent mark on the first e, so it is an é. As far as I know, there is a french dataset which includes these variations but, would it be possible to know when the e is being written in matplotlib and add a - (accent mark) on top of it? I know it sounds horrible, but I'd be pretty useful to generate characters like ñ.

Grzego commented 4 years ago

@YourAverageSciencePal if your data is normalized then it should be fine. Otherwise normalizing it similarly to what is done on lines L98-L115 from preprocess.py file could be beneficial. For conversion from csv to numpy you can use pandas package.

Just one side note. The original dataset on which model was trained contained whole sentences in sequences. Meaning that it could learn how to smoothly move from writing some letter and then another one, like handwriting "at" is slightly different than "am". If I understand correctly your current dataset contains only single letters, and in that case this model might not perform as expected.

Grzego commented 4 years ago

Hi, @espetro. About your question on adding marks to the generated sequence. This is theoretically possible but I would not recommend it. The way to achieve this would be to inspect phi tensor while generating (it holds information about attention of the network related to the letter that it currently writes) and check it for given letter. If the letter is currently being generated you can do some sort of action to "correct" it. Here is a short code snippet to illustrate what I mean:

# somewhere in the for loop in lines 87-108 in generate.py

# record indices of all points related to a letter we want to deal with
# `special_letter_idx` is the index of our special letter in input sequence
if np.argmax(phi_data[-1]) == special_letter_idx:
    special.append(len(coords))

# ...

# somewhere after the previous for loop at line 118 in generate.py

# at this point `coords` actually hold deltas between consecutive points
# so we must "inject" differences to draw something in correct place
special_coords = coords[special]  # select coords related to our letter
cs_special = cumsum(special_coords)  # compute actual letter shape

min_x = np.min(cs_special[:, 0])
max_x = np.max(cs_special[:, 0])
max_y = np.max(cs_special[:, 1])

rightdiff = max_x - min_x
updiff = max_y - cs_special[-1, 1] + 0.05  # how high to move a pen

# create array of differences that would be injected into generated sequence
injection = np.array([
    [0, 0, 1],  # just to lift a pen
    [-rightdiff, -updiff, 0],
    [rightdiff, 0, 1],
    [0, updiff, 1],  # move pen back to starting position
])

coords = np.concatenate((
    coords[:special[-1]],  # before injection
    injection,
    coords[special[-1]:],  # put back everything after
    ), axis=0)

# ...
# proceed to plotting

This will add horizontal line above special_letter_idx letter in input sequence. Keep in mind that it may not be precisely above it. It all depends on the models' internal representation (which may not exactly correspond to actual letters).

As I said previously, I do not recommend doing it that way. Better and probably more reliable way of implementing this, would to be create dataset in the language you want to generate (although I understand this can be rather hard).

techno-yogi commented 6 months ago

Did you ever extend the dataset to full sentences and get a reasonable output for special characters? Digits and consecutive capitals generally perform poor. Thought about adding this as well, but not sure the scope on generating the x,y,z xml format .

Has anyone done this? Any tooling to make it easier?

Grzego / handwriting-generation

Extension to other languages #17