Setup word2vec experiment iPhython notebook

transfluxus commented 7 years ago

added some jupyters into the test folder. see latest commits

todo:

attach a cell or module call that combines all txt from one folder to one txt

schwittlick commented 7 years ago

attach a cell or module call that combines all txt from one folder to one txt:

cd /folder/containing/texts/
cat * > text_collection.txt

/folder/containing/texts/text_collection.txt contains all combined texts

schwittlick commented 7 years ago

ill try to make a jupyter combining our experiments

https://github.com/mrzl/ECO/blob/master/src/python/irc/word2vec.py

schwittlick commented 7 years ago

add some important things from here https://github.com/kylemcdonald/ml-examples/blob/master/workshop/word2vec/word2vec.ipynb

schwittlick commented 7 years ago

https://github.com/kylemcdonald/ml-examples/blob/master/workshop/lda_tsne/LDA%20and%20t-SNE.ipynb

transfluxus commented 7 years ago

trained it again (locally :)) have a different size...

Word2Vec(vocab=167131, size=100, alpha=0.025)
model memory size, mb: 207.205104828

selected only NN tagged words: 100108 applied k-means on that list with k=10,50,200

k 50, and 200 brought some nice groups out

...
['Bristol', 'Norfolk', 'Surrey', 'Vancouver', 'Harbour', 'Southampton', 'Bridge', 'Miami', 'Riverside', 'Beach', 'Newark', 'Greenwich', 'Port', 'Ottawa', 'Hamburg', 'Loch', 'Rapids', 'Cleveland', 'Bremen', 'Anchorage', 'Cornwall', 'Zoo', 'Liverpool', 'Charleston', 'Baltimore', 'Dresden', 'Parkway', 'Pittsburgh', 'Kingston', 'Istanbul']
['left-wing', 'leftist', 'rightwing', 'far-right', 'right-wing', 'populist', 'single-issue', 'rightist', 'Zionist', 'nationalist', 'Reaganite', 'anti-state', 'Nationalist', 'dissident', 'anti-nuclear', 'antistatist', 'antinuclear', 'antidemocratic', 'antigovernment', 'extremist', 'antiestablishment', 'Populist', 'leftwing', 'anticolonial', 'Gandhian', 'isolationist', 'Right-wing', 'anticapitalist', 'pro-Soviet', 'anti-slavery']
['Claxton', 'Hurley', 'Eileen', 'Rucker', 'Haney', 'Woodward', 'Walters', 'Willard', 'Desmond', 'Shelly', 'Elgar', 'Beecher', 'Chatto', 'Franc', 'Wiseman', 'Conley', 'Belinda', 'Sutton', 'Ira', 'Kristin', 'Finlayson', 'J.C.', 'Benson', 'Debra', 'Caitlin', 'Kirsten', 'Joselito', 'Farrell', 'Heather', 'Tiffin']
['retaliation', 'perjury', 'deportation', 'reprisals', 'wrongdoing', 'torture', 'fraud', 'prosecution', 'retaliatory', 'libel', 'bribery', 'arson', 'misconduct', 'kidnapping', 'retribution', 'negligence', 'wrongful', 'treason', 'murder', 'prosecutions', 'intimidation', 'vandalism', 'defamation', 'robbery', 'theft', 'harassment', 'mistreatment', 'inaction', 'entrapment', 'adultery']
['POINT', 'THIRD', 'PROVE', 'UPON', 'GIVEN', 'NIGHT', 'HIGHER', 'AFFECTING', 'FTER', 'TOUCH', 'REQUIRED', 'SEEM', 'RATHER', 'UNDERSTANDING', 'HIM', 'ACTUALLY', 'ARMY', 'PUBLISHED', 'COURT', 'POLLUTION', 'WHOSE', 'NUMBERS', 'SENSE', 'SERVICING', 'SEEMS', 'USES', 'ISSUE', 'HARM', 'SETTING', 'ALMOST']
['overtaken', 'assailed', 'repulsed', 'interrogated', 'seduced', 'shunned', 'devoured', 'gripped', 'captivated', 'bewitched', 'smitten', 'shaken', 'hobbled', 'surmounted', 'eclipsed', 'repelled', 'scrutinised', 'overrun', 'overruled', 'menaced', 'harmed', 'annulled', 'terrorized', 'attacked', 'decreed', 'con\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdrmed', 'unimpressed', 'stunned', 'baffled', 'reciprocated']
['xuz', 'NFZWIODI', 'LZNCIB', 'hgfnn', 'qzb', 'gfpi', 'habe', 'AWCFW', 'bem', 'kcbg', 'Acti', 'bch', 'Medi', 'cosas', 'kinn', '\xef\xbf\xbd\xef\xbf\xbd191', 'Pe', 'rke', 'multis', 'Sed', 'kgfb', 'oa', '\xef\xbf\xbd\xef\xbf\xbd32', 'licet', 'DRAINED', 'tct', 'tiv', 'podemos', 'ENLIG', 'reve']
['ratio', 'coefficient', 'NPV', 'NPP', 'amplitude', 'viscosity', 'conductance', 'OLR', 'thickness', 'modulus', 'density', 'signal-to-noise', 'increment', 'GPP', 'velocity', 'emissivity', 'Gini', 'loudness', 'elasticity', 'voltage', 'salinity', 'chromospheric', 'coefficients', 'transformity', 'probability', 'width', 'LAI', 'coef\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdcient', 'output', 'ET']
['Fot', 'kcbg', 'bgi', 'ita', 'atio', 'bgfb', 'oub', 'gch', 'bgx', 'WUMIU', 'etiam', 'buu', 'Sed', 'gcm', 'kinn', 'kgfb', 'cette', 'ua', 'mx', 'auw', 'tct', 'erit', 'Cl', 'Et', 'stru', 'AWCFW', 'sses', 'gfpi', 'rivacy', 'conta']
...

schwittlick commented 7 years ago

check the meaning of these parameters:

model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())

schwittlick commented 7 years ago

adjust logging to see what's going on when training:

program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))

taken from here http://textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim

schwittlick commented 7 years ago

BTW, for training an own model it should be alright to use ALL parsed text, not only the valid sentences. that would result in a corpus/dataset with 2gb of plain txt :~)

@transfluxus: how to find out if a model is good or bad? how to only select NN words? how did you do this kmeans clustering on the model vocsb?

schwittlick commented 7 years ago

python train_word2vec_model.py --input_path /home/mar/code/marcel/ECO/data/text/nail/ --verbose

transfluxus commented 7 years ago

See my only commit 3 days ago. Some notebook goes through the whole vocabulary of the model and keeps only NNs.

transfluxus commented 7 years ago

K means is from scipy

schwittlick commented 7 years ago

kmeans code is da auch drinnen?

Schwittleymani / ECO

Setup word2vec experiment iPhython notebook #162