eabdullin / Word2Vec.Net

implementation Word2Vec for .Net framework
126 stars 41 forks source link

Training Issue #10

Open visualizeMath opened 8 years ago

visualizeMath commented 8 years ago

Hi. First of all thank you very much for your help. You have saved my life at least several times :) My question is that I have experinced some problems while training word2vec with large data corpus. The data i'd like to use for training process is almost 4 Gb. I wonder whether if it's possible or not. I tried to train word2vec with 2 Gb data and it didn't work too.Shall i increase the heap-size or something like that ?

eabdullin commented 8 years ago

Can you share your training data? I'll try to train vectors :)

CaCTuCaTu4ECKuu commented 8 years ago

I find out where is this issue and #1 I use some 100mb internet data and it was surprise that there is exception, but ther i understand that when I do StreamReader.ReadLine() I read a whole file which is storing with only spaces and thats cause an exception. And actually I dont even sure what to do to save same performance, because there is threads and seek, but you cant just seek through single line so

CaCTuCaTu4ECKuu commented 8 years ago

I solve this by preprocessing train file and separating some amount of words in single line because solid line cause issues even when opening with notepad++ when opening processed files occurs instantly