Closed dkirkby closed 7 years ago
I have started a new branch to try out some ideas on this issue.
One technique for finding pages that I was not able to get working with the wikipedia
package is to find all pages with a certain word in their title. I am looking at the more complete pywikibot
package as an alternative now...
I have decided to switch from wikipedia
to pywikibot
since it is more flexible and provides more options for discovering pages related to a word. I am running some tests now and will open a PR.
I have now built an index of up to 10K articles for each word, for a total of ~4M articles. This is a much larger corpus that I used before, so should improve the embedding significantly. The article counts are below. Note that a few words (ketchup, leprechaun, loch_ness) have much fewer articles available, but that's probably unavoidable. The next step is to download and preprocess all of these articles. I am trying out multiprocessing.Pool
to speed this up...
10000 corpus/Africa.index
9709 corpus/Agent.index
9724 corpus/Air.index
9600 corpus/Alien.index
10000 corpus/Alps.index
9561 corpus/Amazon.index
10000 corpus/Ambulance.index
10000 corpus/America.index
10000 corpus/Angel.index
10000 corpus/Antarctica.index
10000 corpus/Apple.index
9751 corpus/Arm.index
10000 corpus/Atlantis.index
10000 corpus/Australia.index
9702 corpus/Aztec.index
9586 corpus/Back.index
9895 corpus/Ball.index
9261 corpus/Band.index
10000 corpus/Bank.index
10000 corpus/Bar.index
9904 corpus/Bark.index
10000 corpus/Bat.index
9722 corpus/Battery.index
10000 corpus/Beach.index
10000 corpus/Bear.index
9507 corpus/Beat.index
10000 corpus/Bed.index
10000 corpus/Beijing.index
10000 corpus/Bell.index
9079 corpus/Belt.index
10000 corpus/Berlin.index
10000 corpus/Bermuda.index
10000 corpus/Berry.index
9826 corpus/Bill.index
9747 corpus/Block.index
9741 corpus/Board.index
9646 corpus/Bolt.index
10000 corpus/Bomb.index
9644 corpus/Bond.index
9609 corpus/Boom.index
10000 corpus/Boot.index
10000 corpus/Bottle.index
9239 corpus/Bow.index
9372 corpus/Box.index
10000 corpus/Bridge.index
9496 corpus/Brush.index
9734 corpus/Buck.index
9699 corpus/Buffalo.index
9660 corpus/Bug.index
4080 corpus/Bugle.index
10000 corpus/Button.index
6745 corpus/Calf.index
10000 corpus/Canada.index
10000 corpus/Cap.index
9460 corpus/Capital.index
10000 corpus/Car.index
9600 corpus/Card.index
6847 corpus/Carrot.index
10000 corpus/Casino.index
8844 corpus/Cast.index
10000 corpus/Cat.index
9336 corpus/Cell.index
5147 corpus/Centaur.index
9881 corpus/Center.index
9391 corpus/Chair.index
9680 corpus/Change.index
9530 corpus/Charge.index
9579 corpus/Check.index
9825 corpus/Chest.index
9716 corpus/Chick.index
10000 corpus/China.index
10000 corpus/Chocolate.index
9508 corpus/Church.index
10000 corpus/Circle.index
10000 corpus/Cliff.index
7816 corpus/Cloak.index
9776 corpus/Club.index
10000 corpus/Code.index
9724 corpus/Cold.index
9923 corpus/Comic.index
9482 corpus/Compound.index
10000 corpus/Concert.index
9310 corpus/Conductor.index
10000 corpus/Contract.index
9665 corpus/Cook.index
10000 corpus/Copper.index
10000 corpus/Cotton.index
10000 corpus/Court.index
9525 corpus/Cover.index
9425 corpus/Crane.index
9685 corpus/Crash.index
10000 corpus/Cricket.index
10000 corpus/Cross.index
9329 corpus/Crown.index
9698 corpus/Cycle.index
9716 corpus/Czech.index
10000 corpus/Dance.index
9469 corpus/Date.index
10000 corpus/Day.index
10000 corpus/Death.index
9018 corpus/Deck.index
9307 corpus/Degree.index
10000 corpus/Diamond.index
9434 corpus/Dice.index
10000 corpus/Dinosaur.index
10000 corpus/Disease.index
9694 corpus/Doctor.index
10000 corpus/Dog.index
9356 corpus/Draft.index
10000 corpus/Dragon.index
9890 corpus/Dress.index
10000 corpus/Drill.index
9472 corpus/Drop.index
10000 corpus/Duck.index
9624 corpus/Dwarf.index
10000 corpus/Eagle.index
10000 corpus/Egypt.index
10000 corpus/Embassy.index
10000 corpus/Engine.index
10000 corpus/England.index
10000 corpus/Europe.index
10000 corpus/Eye.index
10000 corpus/Face.index
10000 corpus/Fair.index
9729 corpus/Fall.index
9552 corpus/Fan.index
9558 corpus/Fence.index
9785 corpus/Field.index
9632 corpus/Fighter.index
9565 corpus/Figure.index
9575 corpus/File.index
10000 corpus/Film.index
10000 corpus/Fire.index
10000 corpus/Fish.index
10000 corpus/Flute.index
10000 corpus/Fly.index
9830 corpus/Foot.index
10000 corpus/Force.index
10000 corpus/Forest.index
9661 corpus/Fork.index
10000 corpus/France.index
10000 corpus/Game.index
10000 corpus/Gas.index
9548 corpus/Genius.index
10000 corpus/Germany.index
10000 corpus/Ghost.index
9732 corpus/Giant.index
10000 corpus/Glass.index
10000 corpus/Glove.index
10000 corpus/Gold.index
9511 corpus/Grace.index
10000 corpus/Grass.index
10000 corpus/Greece.index
10000 corpus/Green.index
9471 corpus/Ground.index
10000 corpus/Ham.index
10000 corpus/Hand.index
10000 corpus/Hawk.index
10000 corpus/Head.index
10000 corpus/Heart.index
10000 corpus/Helicopter.index
8105 corpus/Himalayas.index
9574 corpus/Hole.index
10000 corpus/Hollywood.index
10000 corpus/Honey.index
9543 corpus/Hood.index
9576 corpus/Hook.index
9497 corpus/Horn.index
10000 corpus/Horse.index
9982 corpus/Horseshoe.index
10000 corpus/Hospital.index
10000 corpus/Hotel.index
10000 corpus/Ice cream.index
10000 corpus/Ice.index
10000 corpus/India.index
10000 corpus/Iron.index
10000 corpus/Ivory.index
9723 corpus/Jack.index
9797 corpus/Jam.index
9629 corpus/Jet.index
10000 corpus/Jupiter.index
10000 corpus/Kangaroo.index
2116 corpus/Ketchup.index
9570 corpus/Key.index
9671 corpus/Kid.index
10000 corpus/King.index
6460 corpus/Kiwi.index
10000 corpus/Knife.index
10000 corpus/Knight.index
9249 corpus/Lab.index
9656 corpus/Lap.index
10000 corpus/Laser.index
10000 corpus/Lawyer.index
10000 corpus/Lead.index
10000 corpus/Lemon.index
1669 corpus/Leprechaun.index
10000 corpus/Life.index
10000 corpus/Light.index
4123 corpus/Limousine.index
9327 corpus/Line.index
9672 corpus/Link.index
10000 corpus/Lion.index
10000 corpus/Litter.index
1571 corpus/Loch ness.index
9704 corpus/Lock.index
9307 corpus/Log.index
10000 corpus/London.index
9791 corpus/Luck.index
10000 corpus/Mail.index
7335 corpus/Mammoth.index
10000 corpus/Maple.index
10000 corpus/Marble.index
10000 corpus/March.index
10000 corpus/Mass.index
9401 corpus/Match.index
9387 corpus/Mercury.index
10000 corpus/Mexico.index
6278 corpus/Microscope.index
9885 corpus/Millionaire.index
9572 corpus/Mine.index
9444 corpus/Mint.index
10000 corpus/Missile.index
9687 corpus/Model.index
9696 corpus/Mole.index
10000 corpus/Moon.index
10000 corpus/Moscow.index
9798 corpus/Mount.index
10000 corpus/Mouse.index
9857 corpus/Mouth.index
5494 corpus/Mug.index
9618 corpus/Nail.index
9595 corpus/Needle.index
9513 corpus/Net.index
9804 corpus/New york.index
10000 corpus/Night.index
10000 corpus/Ninja.index
9503 corpus/Note.index
10000 corpus/Novel.index
10000 corpus/Nurse.index
9787 corpus/Nut.index
6879 corpus/Octopus.index
10000 corpus/Oil.index
10000 corpus/Olive.index
5000 corpus/Olympus.index
10000 corpus/Opera.index
9660 corpus/Orange.index
9468 corpus/Organ.index
9685 corpus/Palm.index
9790 corpus/Pan.index
9723 corpus/Pants.index
10000 corpus/Paper.index
9964 corpus/Parachute.index
10000 corpus/Park.index
9594 corpus/Part.index
9679 corpus/Pass.index
9659 corpus/Paste.index
10000 corpus/Penguin.index
9592 corpus/Phoenix.index
10000 corpus/Piano.index
9849 corpus/Pie.index
9461 corpus/Pilot.index
9721 corpus/Pin.index
9528 corpus/Pipe.index
10000 corpus/Pirate.index
10000 corpus/Pistol.index
9619 corpus/Pit.index
9699 corpus/Pitch.index
9589 corpus/Plane.index
10000 corpus/Plastic.index
9449 corpus/Plate.index
1489 corpus/Platypus.index
9739 corpus/Play.index
9416 corpus/Plot.index
9859 corpus/Point.index
10000 corpus/Poison.index
9548 corpus/Pole.index
10000 corpus/Police.index
9637 corpus/Pool.index
10000 corpus/Port.index
9635 corpus/Post.index
9277 corpus/Pound.index
9693 corpus/Press.index
10000 corpus/Princess.index
8082 corpus/Pumpkin.index
9698 corpus/Pupil.index
10000 corpus/Pyramid.index
9744 corpus/Queen.index
10000 corpus/Rabbit.index
3835 corpus/Racket.index
9721 corpus/Ray.index
10000 corpus/Revolution.index
9566 corpus/Ring.index
9574 corpus/Robin.index
10000 corpus/Robot.index
9699 corpus/Rock.index
10000 corpus/Rome.index
10000 corpus/Root.index
10000 corpus/Rose.index
3678 corpus/Roulette.index
9388 corpus/Round.index
9508 corpus/Row.index
9788 corpus/Ruler.index
10000 corpus/Satellite.index
10000 corpus/Saturn.index
9420 corpus/Scale.index
10000 corpus/School.index
10000 corpus/Scientist.index
10000 corpus/Scorpion.index
9193 corpus/Screen.index
1122 corpus/Scuba diver.index
9512 corpus/Seal.index
8680 corpus/Server.index
9813 corpus/Shadow.index
10000 corpus/Shakespeare.index
10000 corpus/Shark.index
10000 corpus/Ship.index
10000 corpus/Shoe.index
9658 corpus/Shop.index
8904 corpus/Shot.index
9591 corpus/Sink.index
10000 corpus/Skyscraper.index
9417 corpus/Slip.index
10000 corpus/Slug.index
9669 corpus/Smuggler.index
10000 corpus/Snow.index
2612 corpus/Snowman.index
9959 corpus/Sock.index
10000 corpus/Soldier.index
10000 corpus/Soul.index
10000 corpus/Sound.index
10000 corpus/Space.index
9518 corpus/Spell.index
10000 corpus/Spider.index
9675 corpus/Spike.index
9570 corpus/Spine.index
9298 corpus/Spot.index
9767 corpus/Spring.index
10000 corpus/Spy.index
10000 corpus/Square.index
10000 corpus/Stadium.index
9574 corpus/Staff.index
10000 corpus/Star.index
9601 corpus/State.index
9612 corpus/Stick.index
10000 corpus/Stock.index
9907 corpus/Straw.index
10000 corpus/Stream.index
9646 corpus/Strike.index
9358 corpus/String.index
9617 corpus/Sub.index
9572 corpus/Suit.index
10000 corpus/Superhero.index
9522 corpus/Swing.index
9906 corpus/Switch.index
9664 corpus/Table.index
9601 corpus/Tablet.index
9586 corpus/Tag.index
10000 corpus/Tail.index
9533 corpus/Tap.index
10000 corpus/Teacher.index
9989 corpus/Telescope.index
10000 corpus/Temple.index
10000 corpus/Theater.index
9807 corpus/Thief.index
9830 corpus/Thumb.index
7516 corpus/Tick.index
9505 corpus/Tie.index
10000 corpus/Time.index
10000 corpus/Tokyo.index
10000 corpus/Tooth.index
9871 corpus/Torch.index
10000 corpus/Tower.index
9645 corpus/Track.index
10000 corpus/Train.index
10000 corpus/Triangle.index
9450 corpus/Trip.index
9304 corpus/Trunk.index
9533 corpus/Tube.index
10000 corpus/Turkey.index
3014 corpus/Undertaker.index
8251 corpus/Unicorn.index
9905 corpus/Vacuum.index
10000 corpus/Van.index
9699 corpus/Vet.index
9836 corpus/Wake.index
9945 corpus/Wall.index
10000 corpus/War.index
2231 corpus/Washer.index
9834 corpus/Washington.index
10000 corpus/Watch.index
10000 corpus/Water.index
10000 corpus/Wave.index
9751 corpus/Web.index
9708 corpus/Well.index
10000 corpus/Whale.index
9803 corpus/Whip.index
10000 corpus/Wind.index
10000 corpus/Witch.index
10000 corpus/Worm.index
10000 corpus/Yard.index
3794722 total
For reference, the original corpus had:
The practical limit is probably the memory requirement for training. In order to increase the corpus size by a factor of ~10, we should aim for ~5M characters per word in the new corpus.
The following shows the number of articles required to reach 1M characters for the first 10 words:
Africa 95 1088327
Agent 213 1002219
Air 154 1015527
Alien 146 1000752
Alps 179 1001990
Amazon 366 1080288
Ambulance 105 1025889
America 132 1027595
Angel 126 1045920
Antarctica 227 1203372
This indicates that 500 - 2,000 articles / word will be required to build the new corpus. Ideally, we would randomly sub-sample all the available articles for a word, but this is not really practical since downloading an article is an all-or-nothing proposition. Instead, we can iterate through the articles in a random order until we reach 5M characters.
I ran the new fetch_corpus_text.py
script to convert each corpus/Word.index
file into a corresponding corpus/Word.txt.gz
file that contains the plain (unicode) text of a random subset of the indexed articles in order to reach ~5M characters of text.
The next step is to preprocess each text file so it can be fed directly into word2vec:
Most of this is already done by merge_corpus.py
but it will need some minor updates.
Preprocessing is finished now and saves some statistics on each word:
WORD TOTFREQ XFREQ NSENT NWORD
time 375916 371703 31976 734285
state 235078 222082 33116 775979
school 233885 215214 32939 758983
game 227487 216503 36847 808653
film 224783 214068 33813 807208
part 224584 218982 33322 780953
well 203564 200311 35424 780812
war 196840 191488 36427 785049
air 159591 145649 34186 769514
back 152262 147923 37028 808733
...
racket 800 434 36053 792250
loch_ness 781 308 35429 792689
platypus 763 364 31681 766810
sock 750 612 36137 815270
washer 707 431 36867 798022
vet 628 478 36824 789117
leprechaun 595 282 35303 812171
mug 552 389 35017 810713
smuggler 494 353 36358 799093
scuba_diver 278 102 34637 796666
The columns are:
Most words appear often in the articles selected for other words, which is good since this is how the embedding learns the relationship between these words. Unfortunately, words at the bottom of the list are quite rare even with this 10x expanded corpus.
The next step is to update the learning script to work with these new preprocessed files (Word.pre.gz) instead of the earlier single randomized corpus.txt.gz. The changes are to:
The learning parameters can probably stay the same. In particular, embedding into a 300-dimensional space still seems like a good choice.
The number of passes through the corpus can be reduced ~10x to account for the ~10x increase in corpus size.
The new corpus turns out to be too big to shuffle in memory, so I am using a partial shuffle instead:
This still takes ~8 mins but does not need much memory and is fast enough.
Here are the word2vec
stats for training with the new corpus:
2017-01-13 11:22:51,997 : INFO : collected 3515152 word types from a corpus of 313247904 raw words and 14208361 sentences
2017-01-13 11:22:51,997 : INFO : Loading a fresh vocabulary
2017-01-13 11:22:56,080 : INFO : min_count=100 retains 81515 unique words (2% of original 3515152, drops 3433637)
2017-01-13 11:22:56,080 : INFO : min_count=100 leaves 298768277 word corpus (95% of original 313247904, drops 14479627)
2017-01-13 11:22:56,309 : INFO : deleting the raw counts dictionary of 3515152 items
2017-01-13 11:22:57,145 : INFO : sample=0.001 downsamples 27 most-common words
2017-01-13 11:22:57,145 : INFO : downsampling leaves estimated 235025153 word corpus (78.7% of prior 298768277)
2017-01-13 11:22:57,145 : INFO : estimated required memory for 81515 words and 300 dimensions: 350514500 bytes
2017-01-13 11:22:57,298 : INFO : constructing a huffman tree from 81515 words
2017-01-13 11:23:00,498 : INFO : built huffman tree with maximum node depth 22
2017-01-13 11:23:00,708 : INFO : resetting layer weights
2017-01-13 11:23:01,939 : INFO : training model with 4 workers on 81515 vocabulary and 300 features, using sg=1 hs=1 sample=0.001 negative=5 window=10
The vocab size here (~82K) might be too large, so try again with min_count=150
:
2017-01-13 11:41:55,210 : INFO : collected 3515152 word types from a corpus of 313247904 raw words and 14208361 sentences
2017-01-13 11:41:55,211 : INFO : Loading a fresh vocabulary
2017-01-13 11:41:59,127 : INFO : min_count=150 retains 62796 unique words (1% of original 3515152, drops 3452356)
2017-01-13 11:41:59,127 : INFO : min_count=150 leaves 296490234 word corpus (94% of original 313247904, drops 16757670)
2017-01-13 11:41:59,297 : INFO : deleting the raw counts dictionary of 3515152 items
2017-01-13 11:42:00,126 : INFO : sample=0.001 downsamples 27 most-common words
2017-01-13 11:42:00,126 : INFO : downsampling leaves estimated 232590765 word corpus (78.4% of prior 296490234)
2017-01-13 11:42:00,126 : INFO : estimated required memory for 62796 words and 300 dimensions: 270022800 bytes
2017-01-13 11:42:00,252 : INFO : constructing a huffman tree from 62796 words
2017-01-13 11:42:02,702 : INFO : built huffman tree with maximum node depth 21
2017-01-13 11:42:02,870 : INFO : resetting layer weights
2017-01-13 11:42:03,871 : INFO : training model with 4 workers on 62796 vocabulary and 300 features, using sg=1 hs=1 sample=0.001 negative=5 window=10
For comparison, the training on the original small corpus used:
2017-01-01 18:59:28,025 : INFO : collected 552467 word types from a corpus of 29323417 raw words and 1309803 sentences
2017-01-01 18:59:28,025 : INFO : Loading a fresh vocabulary
2017-01-01 18:59:28,468 : INFO : min_count=45 retains 28701 unique words (5% of original 552467, drops 523766)
2017-01-01 18:59:28,468 : INFO : min_count=45 leaves 27486765 word corpus (93% of original 29323417, drops 1836652)
2017-01-01 18:59:28,684 : INFO : deleting the raw counts dictionary of 552467 items
2017-01-01 18:59:28,829 : INFO : sample=0.001 downsamples 30 most-common words
2017-01-01 18:59:28,829 : INFO : downsampling leaves estimated 21256403 word corpus (77.3% of prior 27486765)
2017-01-01 18:59:28,830 : INFO : estimated required memory for 28701 words and 300 dimensions: 123414300 bytes
2017-01-01 18:59:28,901 : INFO : constructing a huffman tree from 28701 words
2017-01-01 18:59:30,174 : INFO : built huffman tree with maximum node depth 19
2017-01-01 18:59:30,236 : INFO : resetting layer weights
2017-01-01 18:59:30,830 : INFO : training model with 12 workers on 28701 vocabulary and 300 features, using sg=1 hs=1 sample=0.001 negati
ve=5 window=10
To summarize:
For comparison this article claims that:
At face value, this says the new corpus is ~7x EB!
The new corpus is drawn from an index of ~3.8M articles, compared with a total of ~5.4M in the entire wikipedia! Only a subset of articles are actually used in the final corpus, but this indicates that some articles must be appearing more than once in the index files, which I didn't account for.
Results from evaluate.py
after 10 epochs (comparable to 100 epochs on the old corpus):
0.972 MARCH = april
0.931 FLUTE = clarinet
0.910 PIANO = violin
0.891 GOLD = silver
0.867 PISTOL = semi-automatic
0.852 CHOCOLATE = caramel
0.848 PANTS = trousers
0.845 MISSILE = surface-to-air
0.843 BERLIN = munich
0.840 TOKYO = osaka
0.837 DEGREE = bachelor
0.830 WHALE = humpback
0.828 CHURCH = episcopal
0.828 KETCHUP = mayonnaise
0.824 COURT = supreme
0.823 SERVER = client
0.823 THUMB = finger
0.821 JUPITER = neptune
0.814 GERMANY = austria
0.814 DISEASE = infection
0.849 PIANO + FLUTE = cello
0.805 PANTS + DRESS = trousers
0.755 LEMON + CHOCOLATE = vanilla
0.751 GERMANY + FRANCE = belgium
0.750 HORSESHOE + BAT = rhinolophus
0.729 STRING + PIANO = quartet
0.723 ICE_CREAM + CHOCOLATE = candy
0.719 PASTE + KETCHUP = garlic
0.718 WEB + SERVER = browser
0.709 TURKEY + GREECE = cyprus
0.707 HOTEL + CASINO = resort
0.703 ORGAN + FLUTE = harpsichord
0.703 PIANO + ORGAN = harpsichord
0.699 RABBIT + DOG = cat
0.696 PIANO + HORN = flute
0.690 STRING + FLUTE = violin
0.686 SCHOOL + DEGREE = graduate
0.679 HORN + FLUTE = trumpet
0.678 MOON + JUPITER = venus
0.672 GERMANY + CZECH = poland
Compare with the results for the old corpus in #9.
The new embedding looks good overall. Less obsessed with wrestling, but I wonder how many people would get the rhinolophus clue.
The corpus is currently scraped from a list of wikipedia articles that is automatically derived from the word list, but the resulting coverage (number of times each word appears) is quite un-even. This issue is to split the work now done by
build_corpus.py
into two tasks:corpus/
directory.The motivation for this split is to allow the first step to be improved without needing to run the second step (which takes most of the time). An added benefit is that the second step could be easily parallelized since it is limited by network IO, not CPU.
The goal of the improvements to the first step is to: