Unicode support - Githubissues

slbinilkumar commented 7 years ago

Hi, What are the modifications had to be done for Unicode support . I need to do it for Indian languages.

stephenvxx commented 7 years ago

Change dictionary to Indian langugages, modify DeepSpeechModel.lua, fullyConnected:add(nn.Linear(rnnHiddenSize, dict_size))

Change dict_size to the length of dictionary, example the length of dictionary_english : 29

slbinilkumar commented 7 years ago

Thank you. On Aug 24, 2017 9:05 AM, "Dat Thanh Vu" notifications@github.com wrote:

Change dictionary to Indian langugages, modify DeepSpeechModel.lua, fullyConnected:add(nn.Linear(rnnHiddenSize, dict_size))

Change dict_size to the length of dictionary, example the length of dictionary_english : 29

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SeanNaren/deepspeech.torch/issues/94#issuecomment-324523691, or mute the thread https://github.com/notifications/unsubscribe-auth/ARCwvcsIIe2Iaaz2TCbMCbWvblSiOjcxks5sbO96gaJpZM4OocQZ .

slbinilkumar commented 7 years ago

In mapper.lua it is reading byte wise .So that will break Unicode characters .I had changed mapper.lua and read characters with unicode support. On Aug 24, 2017 9:05 AM, "Dat Thanh Vu" notifications@github.com wrote:

Change dictionary to Indian langugages, modify DeepSpeechModel.lua, fullyConnected:add(nn.Linear(rnnHiddenSize, dict_size))

Change dict_size to the length of dictionary, example the length of dictionary_english : 29

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SeanNaren/deepspeech.torch/issues/94#issuecomment-324523691, or mute the thread https://github.com/notifications/unsubscribe-auth/ARCwvcsIIe2Iaaz2TCbMCbWvblSiOjcxks5sbO96gaJpZM4OocQZ .

SeanNaren commented 7 years ago

If you could open a PR with those changes that would be awesome :)

stephenvxx commented 7 years ago

@slbinilkumar Dont worry, use Lua UTF-8 library instead of string library. use utf8.lower instead of string.lower In for loop line 29, change to : for _, c in utf8.codes(line) do local character = utf8.char(c) table.insert(label, self.alphabet2token[character]) end Please install utf-8 library https://github.com/starwing/luautf8 *Make sure your dictionary don't copy from the Internet or other. Should self-writing Indian Language.

slbinilkumar commented 7 years ago

Thank you On Aug 28, 2017 12:23 PM, "Dat Thanh Vu" notifications@github.com wrote:

@slbinilkumar https://github.com/slbinilkumar Dont worry, use Lua UTF-8 library instead of string library. My script: require 'torch' local utf8 = require 'lua-utf8' -- construct an object to deal with the mapping local mapper = torch.class('Mapper')

function mapper:__init(dictPath) assert(paths.filep(dictPath), dictPath ..' not found')

self.alphabet2token = {} self.token2alphabet = {} -- make maps local cnt = 0 for line in io.lines(dictPath) do --local line = utf8.char(line) self.alphabet2token[line] = cnt --print(self.alphabet2token[line]) self.token2alphabet[cnt] = line --print(self.token2alphabet[cnt]) cnt = cnt + 1 end --print(self.alphabet2token['$'])

end

function mapper:encodeString(line)

line = utf8.lower(line)

-- print(line) local label = {} for _,c in utf8.codes(line) do local character = utf8.char(c) --print(character) table.insert(label, self.alphabet2token[character]) --print(label[i]) --print(self.alphabet2token[character]) end --print(label) return label end

function mapper:decodeOutput(predictions) --[[ Turns the predictions tensor into a list of the most likely tokens
NOTE:i
    to compute WER we strip the begining and ending spaces
--]] --print("Predictions...",predictions) local tokens = {} local blankToken = self.alphabet2token['$'] local preToken = blankToken --print("preToken ",preToken) -- The prediction is a sequence of likelihood vectors local _, maxIndices = torch.max(predictions, 2) maxIndices = maxIndices:float():squeeze() --print("maxIndices",maxIndices) for i=1, maxIndices:size(1) do local token = maxIndices[i] - 1 -- CTC indexes start from 1, while token starts from 0 -- add token if it's not blank, and is not the same as pre_token --print("Token ",token) if token ~= blankToken and token ~= preToken then table.insert(tokens, token) end preToken = token end --print("Tokens",tokens) return tokens

end

function mapper:tokensToText(tokens) local text = "" for i, t in ipairs(tokens) do --print(i,t) text = text .. self.token2alphabet[tokens[i]] --print(text) end -- print(text) return text end

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SeanNaren/deepspeech.torch/issues/94#issuecomment-325274446, or mute the thread https://github.com/notifications/unsubscribe-auth/ARCwvePbg6Gj1O-eyGbHsnEDS1BH7Ln5ks5scmPOgaJpZM4OocQZ .

SeanNaren / deepspeech.torch

Unicode support #94