dkirkby / CodeNames

AI for CodeNames
MIT License
18 stars 7 forks source link

Expand corpus #5

Closed dkirkby closed 7 years ago

dkirkby commented 7 years ago

The corpus is currently scraped from a list of wikipedia articles that is automatically derived from the word list, but the resulting coverage (number of times each word appears) is quite un-even. This issue is to split the work now done by build_corpus.py into two tasks:

The motivation for this split is to allow the first step to be improved without needing to run the second step (which takes most of the time). An added benefit is that the second step could be easily parallelized since it is limited by network IO, not CPU.

The goal of the improvements to the first step is to:

dkirkby commented 7 years ago

I have started a new branch to try out some ideas on this issue.

One technique for finding pages that I was not able to get working with the wikipedia package is to find all pages with a certain word in their title. I am looking at the more complete pywikibot package as an alternative now...

dkirkby commented 7 years ago

I have decided to switch from wikipedia to pywikibot since it is more flexible and provides more options for discovering pages related to a word. I am running some tests now and will open a PR.

dkirkby commented 7 years ago

I have now built an index of up to 10K articles for each word, for a total of ~4M articles. This is a much larger corpus that I used before, so should improve the embedding significantly. The article counts are below. Note that a few words (ketchup, leprechaun, loch_ness) have much fewer articles available, but that's probably unavoidable. The next step is to download and preprocess all of these articles. I am trying out multiprocessing.Pool to speed this up...

   10000 corpus/Africa.index
    9709 corpus/Agent.index
    9724 corpus/Air.index
    9600 corpus/Alien.index
   10000 corpus/Alps.index
    9561 corpus/Amazon.index
   10000 corpus/Ambulance.index
   10000 corpus/America.index
   10000 corpus/Angel.index
   10000 corpus/Antarctica.index
   10000 corpus/Apple.index
    9751 corpus/Arm.index
   10000 corpus/Atlantis.index
   10000 corpus/Australia.index
    9702 corpus/Aztec.index
    9586 corpus/Back.index
    9895 corpus/Ball.index
    9261 corpus/Band.index
   10000 corpus/Bank.index
   10000 corpus/Bar.index
    9904 corpus/Bark.index
   10000 corpus/Bat.index
    9722 corpus/Battery.index
   10000 corpus/Beach.index
   10000 corpus/Bear.index
    9507 corpus/Beat.index
   10000 corpus/Bed.index
   10000 corpus/Beijing.index
   10000 corpus/Bell.index
    9079 corpus/Belt.index
   10000 corpus/Berlin.index
   10000 corpus/Bermuda.index
   10000 corpus/Berry.index
    9826 corpus/Bill.index
    9747 corpus/Block.index
    9741 corpus/Board.index
    9646 corpus/Bolt.index
   10000 corpus/Bomb.index
    9644 corpus/Bond.index
    9609 corpus/Boom.index
   10000 corpus/Boot.index
   10000 corpus/Bottle.index
    9239 corpus/Bow.index
    9372 corpus/Box.index
   10000 corpus/Bridge.index
    9496 corpus/Brush.index
    9734 corpus/Buck.index
    9699 corpus/Buffalo.index
    9660 corpus/Bug.index
    4080 corpus/Bugle.index
   10000 corpus/Button.index
    6745 corpus/Calf.index
   10000 corpus/Canada.index
   10000 corpus/Cap.index
    9460 corpus/Capital.index
   10000 corpus/Car.index
    9600 corpus/Card.index
    6847 corpus/Carrot.index
   10000 corpus/Casino.index
    8844 corpus/Cast.index
   10000 corpus/Cat.index
    9336 corpus/Cell.index
    5147 corpus/Centaur.index
    9881 corpus/Center.index
    9391 corpus/Chair.index
    9680 corpus/Change.index
    9530 corpus/Charge.index
    9579 corpus/Check.index
    9825 corpus/Chest.index
    9716 corpus/Chick.index
   10000 corpus/China.index
   10000 corpus/Chocolate.index
    9508 corpus/Church.index
   10000 corpus/Circle.index
   10000 corpus/Cliff.index
    7816 corpus/Cloak.index
    9776 corpus/Club.index
   10000 corpus/Code.index
    9724 corpus/Cold.index
    9923 corpus/Comic.index
    9482 corpus/Compound.index
   10000 corpus/Concert.index
    9310 corpus/Conductor.index
   10000 corpus/Contract.index
    9665 corpus/Cook.index
   10000 corpus/Copper.index
   10000 corpus/Cotton.index
   10000 corpus/Court.index
    9525 corpus/Cover.index
    9425 corpus/Crane.index
    9685 corpus/Crash.index
   10000 corpus/Cricket.index
   10000 corpus/Cross.index
    9329 corpus/Crown.index
    9698 corpus/Cycle.index
    9716 corpus/Czech.index
   10000 corpus/Dance.index
    9469 corpus/Date.index
   10000 corpus/Day.index
   10000 corpus/Death.index
    9018 corpus/Deck.index
    9307 corpus/Degree.index
   10000 corpus/Diamond.index
    9434 corpus/Dice.index
   10000 corpus/Dinosaur.index
   10000 corpus/Disease.index
    9694 corpus/Doctor.index
   10000 corpus/Dog.index
    9356 corpus/Draft.index
   10000 corpus/Dragon.index
    9890 corpus/Dress.index
   10000 corpus/Drill.index
    9472 corpus/Drop.index
   10000 corpus/Duck.index
    9624 corpus/Dwarf.index
   10000 corpus/Eagle.index
   10000 corpus/Egypt.index
   10000 corpus/Embassy.index
   10000 corpus/Engine.index
   10000 corpus/England.index
   10000 corpus/Europe.index
   10000 corpus/Eye.index
   10000 corpus/Face.index
   10000 corpus/Fair.index
    9729 corpus/Fall.index
    9552 corpus/Fan.index
    9558 corpus/Fence.index
    9785 corpus/Field.index
    9632 corpus/Fighter.index
    9565 corpus/Figure.index
    9575 corpus/File.index
   10000 corpus/Film.index
   10000 corpus/Fire.index
   10000 corpus/Fish.index
   10000 corpus/Flute.index
   10000 corpus/Fly.index
    9830 corpus/Foot.index
   10000 corpus/Force.index
   10000 corpus/Forest.index
    9661 corpus/Fork.index
   10000 corpus/France.index
   10000 corpus/Game.index
   10000 corpus/Gas.index
    9548 corpus/Genius.index
   10000 corpus/Germany.index
   10000 corpus/Ghost.index
    9732 corpus/Giant.index
   10000 corpus/Glass.index
   10000 corpus/Glove.index
   10000 corpus/Gold.index
    9511 corpus/Grace.index
   10000 corpus/Grass.index
   10000 corpus/Greece.index
   10000 corpus/Green.index
    9471 corpus/Ground.index
   10000 corpus/Ham.index
   10000 corpus/Hand.index
   10000 corpus/Hawk.index
   10000 corpus/Head.index
   10000 corpus/Heart.index
   10000 corpus/Helicopter.index
    8105 corpus/Himalayas.index
    9574 corpus/Hole.index
   10000 corpus/Hollywood.index
   10000 corpus/Honey.index
    9543 corpus/Hood.index
    9576 corpus/Hook.index
    9497 corpus/Horn.index
   10000 corpus/Horse.index
    9982 corpus/Horseshoe.index
   10000 corpus/Hospital.index
   10000 corpus/Hotel.index
   10000 corpus/Ice cream.index
   10000 corpus/Ice.index
   10000 corpus/India.index
   10000 corpus/Iron.index
   10000 corpus/Ivory.index
    9723 corpus/Jack.index
    9797 corpus/Jam.index
    9629 corpus/Jet.index
   10000 corpus/Jupiter.index
   10000 corpus/Kangaroo.index
    2116 corpus/Ketchup.index
    9570 corpus/Key.index
    9671 corpus/Kid.index
   10000 corpus/King.index
    6460 corpus/Kiwi.index
   10000 corpus/Knife.index
   10000 corpus/Knight.index
    9249 corpus/Lab.index
    9656 corpus/Lap.index
   10000 corpus/Laser.index
   10000 corpus/Lawyer.index
   10000 corpus/Lead.index
   10000 corpus/Lemon.index
    1669 corpus/Leprechaun.index
   10000 corpus/Life.index
   10000 corpus/Light.index
    4123 corpus/Limousine.index
    9327 corpus/Line.index
    9672 corpus/Link.index
   10000 corpus/Lion.index
   10000 corpus/Litter.index
    1571 corpus/Loch ness.index
    9704 corpus/Lock.index
    9307 corpus/Log.index
   10000 corpus/London.index
    9791 corpus/Luck.index
   10000 corpus/Mail.index
    7335 corpus/Mammoth.index
   10000 corpus/Maple.index
   10000 corpus/Marble.index
   10000 corpus/March.index
   10000 corpus/Mass.index
    9401 corpus/Match.index
    9387 corpus/Mercury.index
   10000 corpus/Mexico.index
    6278 corpus/Microscope.index
    9885 corpus/Millionaire.index
    9572 corpus/Mine.index
    9444 corpus/Mint.index
   10000 corpus/Missile.index
    9687 corpus/Model.index
    9696 corpus/Mole.index
   10000 corpus/Moon.index
   10000 corpus/Moscow.index
    9798 corpus/Mount.index
   10000 corpus/Mouse.index
    9857 corpus/Mouth.index
    5494 corpus/Mug.index
    9618 corpus/Nail.index
    9595 corpus/Needle.index
    9513 corpus/Net.index
    9804 corpus/New york.index
   10000 corpus/Night.index
   10000 corpus/Ninja.index
    9503 corpus/Note.index
   10000 corpus/Novel.index
   10000 corpus/Nurse.index
    9787 corpus/Nut.index
    6879 corpus/Octopus.index
   10000 corpus/Oil.index
   10000 corpus/Olive.index
    5000 corpus/Olympus.index
   10000 corpus/Opera.index
    9660 corpus/Orange.index
    9468 corpus/Organ.index
    9685 corpus/Palm.index
    9790 corpus/Pan.index
    9723 corpus/Pants.index
   10000 corpus/Paper.index
    9964 corpus/Parachute.index
   10000 corpus/Park.index
    9594 corpus/Part.index
    9679 corpus/Pass.index
    9659 corpus/Paste.index
   10000 corpus/Penguin.index
    9592 corpus/Phoenix.index
   10000 corpus/Piano.index
    9849 corpus/Pie.index
    9461 corpus/Pilot.index
    9721 corpus/Pin.index
    9528 corpus/Pipe.index
   10000 corpus/Pirate.index
   10000 corpus/Pistol.index
    9619 corpus/Pit.index
    9699 corpus/Pitch.index
    9589 corpus/Plane.index
   10000 corpus/Plastic.index
    9449 corpus/Plate.index
    1489 corpus/Platypus.index
    9739 corpus/Play.index
    9416 corpus/Plot.index
    9859 corpus/Point.index
   10000 corpus/Poison.index
    9548 corpus/Pole.index
   10000 corpus/Police.index
    9637 corpus/Pool.index
   10000 corpus/Port.index
    9635 corpus/Post.index
    9277 corpus/Pound.index
    9693 corpus/Press.index
   10000 corpus/Princess.index
    8082 corpus/Pumpkin.index
    9698 corpus/Pupil.index
   10000 corpus/Pyramid.index
    9744 corpus/Queen.index
   10000 corpus/Rabbit.index
    3835 corpus/Racket.index
    9721 corpus/Ray.index
   10000 corpus/Revolution.index
    9566 corpus/Ring.index
    9574 corpus/Robin.index
   10000 corpus/Robot.index
    9699 corpus/Rock.index
   10000 corpus/Rome.index
   10000 corpus/Root.index
   10000 corpus/Rose.index
    3678 corpus/Roulette.index
    9388 corpus/Round.index
    9508 corpus/Row.index
    9788 corpus/Ruler.index
   10000 corpus/Satellite.index
   10000 corpus/Saturn.index
    9420 corpus/Scale.index
   10000 corpus/School.index
   10000 corpus/Scientist.index
   10000 corpus/Scorpion.index
    9193 corpus/Screen.index
    1122 corpus/Scuba diver.index
    9512 corpus/Seal.index
    8680 corpus/Server.index
    9813 corpus/Shadow.index
   10000 corpus/Shakespeare.index
   10000 corpus/Shark.index
   10000 corpus/Ship.index
   10000 corpus/Shoe.index
    9658 corpus/Shop.index
    8904 corpus/Shot.index
    9591 corpus/Sink.index
   10000 corpus/Skyscraper.index
    9417 corpus/Slip.index
   10000 corpus/Slug.index
    9669 corpus/Smuggler.index
   10000 corpus/Snow.index
    2612 corpus/Snowman.index
    9959 corpus/Sock.index
   10000 corpus/Soldier.index
   10000 corpus/Soul.index
   10000 corpus/Sound.index
   10000 corpus/Space.index
    9518 corpus/Spell.index
   10000 corpus/Spider.index
    9675 corpus/Spike.index
    9570 corpus/Spine.index
    9298 corpus/Spot.index
    9767 corpus/Spring.index
   10000 corpus/Spy.index
   10000 corpus/Square.index
   10000 corpus/Stadium.index
    9574 corpus/Staff.index
   10000 corpus/Star.index
    9601 corpus/State.index
    9612 corpus/Stick.index
   10000 corpus/Stock.index
    9907 corpus/Straw.index
   10000 corpus/Stream.index
    9646 corpus/Strike.index
    9358 corpus/String.index
    9617 corpus/Sub.index
    9572 corpus/Suit.index
   10000 corpus/Superhero.index
    9522 corpus/Swing.index
    9906 corpus/Switch.index
    9664 corpus/Table.index
    9601 corpus/Tablet.index
    9586 corpus/Tag.index
   10000 corpus/Tail.index
    9533 corpus/Tap.index
   10000 corpus/Teacher.index
    9989 corpus/Telescope.index
   10000 corpus/Temple.index
   10000 corpus/Theater.index
    9807 corpus/Thief.index
    9830 corpus/Thumb.index
    7516 corpus/Tick.index
    9505 corpus/Tie.index
   10000 corpus/Time.index
   10000 corpus/Tokyo.index
   10000 corpus/Tooth.index
    9871 corpus/Torch.index
   10000 corpus/Tower.index
    9645 corpus/Track.index
   10000 corpus/Train.index
   10000 corpus/Triangle.index
    9450 corpus/Trip.index
    9304 corpus/Trunk.index
    9533 corpus/Tube.index
   10000 corpus/Turkey.index
    3014 corpus/Undertaker.index
    8251 corpus/Unicorn.index
    9905 corpus/Vacuum.index
   10000 corpus/Van.index
    9699 corpus/Vet.index
    9836 corpus/Wake.index
    9945 corpus/Wall.index
   10000 corpus/War.index
    2231 corpus/Washer.index
    9834 corpus/Washington.index
   10000 corpus/Watch.index
   10000 corpus/Water.index
   10000 corpus/Wave.index
    9751 corpus/Web.index
    9708 corpus/Well.index
   10000 corpus/Whale.index
    9803 corpus/Whip.index
   10000 corpus/Wind.index
   10000 corpus/Witch.index
   10000 corpus/Worm.index
   10000 corpus/Yard.index
 3794722 total
dkirkby commented 7 years ago

For reference, the original corpus had:

The practical limit is probably the memory requirement for training. In order to increase the corpus size by a factor of ~10, we should aim for ~5M characters per word in the new corpus.

The following shows the number of articles required to reach 1M characters for the first 10 words:

Africa 95 1088327
Agent 213 1002219
Air 154 1015527
Alien 146 1000752
Alps 179 1001990
Amazon 366 1080288
Ambulance 105 1025889
America 132 1027595
Angel 126 1045920
Antarctica 227 1203372

This indicates that 500 - 2,000 articles / word will be required to build the new corpus. Ideally, we would randomly sub-sample all the available articles for a word, but this is not really practical since downloading an article is an all-or-nothing proposition. Instead, we can iterate through the articles in a random order until we reach 5M characters.

dkirkby commented 7 years ago

I ran the new fetch_corpus_text.py script to convert each corpus/Word.index file into a corresponding corpus/Word.txt.gz file that contains the plain (unicode) text of a random subset of the indexed articles in order to reach ~5M characters of text.

The next step is to preprocess each text file so it can be fed directly into word2vec:

Most of this is already done by merge_corpus.py but it will need some minor updates.

dkirkby commented 7 years ago

Preprocessing is finished now and saves some statistics on each word:

WORD         TOTFREQ    XFREQ    NSENT    NWORD
time          375916   371703    31976   734285
state         235078   222082    33116   775979
school        233885   215214    32939   758983
game          227487   216503    36847   808653
film          224783   214068    33813   807208
part          224584   218982    33322   780953
well          203564   200311    35424   780812
war           196840   191488    36427   785049
air           159591   145649    34186   769514
back          152262   147923    37028   808733
...
racket           800      434    36053   792250
loch_ness        781      308    35429   792689
platypus         763      364    31681   766810
sock             750      612    36137   815270
washer           707      431    36867   798022
vet              628      478    36824   789117
leprechaun       595      282    35303   812171
mug              552      389    35017   810713
smuggler         494      353    36358   799093
scuba_diver      278      102    34637   796666

The columns are:

Most words appear often in the articles selected for other words, which is good since this is how the embedding learns the relationship between these words. Unfortunately, words at the bottom of the list are quite rare even with this 10x expanded corpus.

dkirkby commented 7 years ago

The next step is to update the learning script to work with these new preprocessed files (Word.pre.gz) instead of the earlier single randomized corpus.txt.gz. The changes are to:

The learning parameters can probably stay the same. In particular, embedding into a 300-dimensional space still seems like a good choice.

The number of passes through the corpus can be reduced ~10x to account for the ~10x increase in corpus size.

dkirkby commented 7 years ago

The new corpus turns out to be too big to shuffle in memory, so I am using a partial shuffle instead:

This still takes ~8 mins but does not need much memory and is fast enough.

dkirkby commented 7 years ago

Here are the word2vec stats for training with the new corpus:

2017-01-13 11:22:51,997 : INFO : collected 3515152 word types from a corpus of 313247904 raw words and 14208361 sentences
2017-01-13 11:22:51,997 : INFO : Loading a fresh vocabulary
2017-01-13 11:22:56,080 : INFO : min_count=100 retains 81515 unique words (2% of original 3515152, drops 3433637)
2017-01-13 11:22:56,080 : INFO : min_count=100 leaves 298768277 word corpus (95% of original 313247904, drops 14479627)
2017-01-13 11:22:56,309 : INFO : deleting the raw counts dictionary of 3515152 items
2017-01-13 11:22:57,145 : INFO : sample=0.001 downsamples 27 most-common words
2017-01-13 11:22:57,145 : INFO : downsampling leaves estimated 235025153 word corpus (78.7% of prior 298768277)
2017-01-13 11:22:57,145 : INFO : estimated required memory for 81515 words and 300 dimensions: 350514500 bytes
2017-01-13 11:22:57,298 : INFO : constructing a huffman tree from 81515 words
2017-01-13 11:23:00,498 : INFO : built huffman tree with maximum node depth 22
2017-01-13 11:23:00,708 : INFO : resetting layer weights
2017-01-13 11:23:01,939 : INFO : training model with 4 workers on 81515 vocabulary and 300 features, using sg=1 hs=1 sample=0.001 negative=5 window=10

The vocab size here (~82K) might be too large, so try again with min_count=150:

2017-01-13 11:41:55,210 : INFO : collected 3515152 word types from a corpus of 313247904 raw words and 14208361 sentences
2017-01-13 11:41:55,211 : INFO : Loading a fresh vocabulary
2017-01-13 11:41:59,127 : INFO : min_count=150 retains 62796 unique words (1% of original 3515152, drops 3452356)
2017-01-13 11:41:59,127 : INFO : min_count=150 leaves 296490234 word corpus (94% of original 313247904, drops 16757670)
2017-01-13 11:41:59,297 : INFO : deleting the raw counts dictionary of 3515152 items
2017-01-13 11:42:00,126 : INFO : sample=0.001 downsamples 27 most-common words
2017-01-13 11:42:00,126 : INFO : downsampling leaves estimated 232590765 word corpus (78.4% of prior 296490234)
2017-01-13 11:42:00,126 : INFO : estimated required memory for 62796 words and 300 dimensions: 270022800 bytes
2017-01-13 11:42:00,252 : INFO : constructing a huffman tree from 62796 words
2017-01-13 11:42:02,702 : INFO : built huffman tree with maximum node depth 21
2017-01-13 11:42:02,870 : INFO : resetting layer weights
2017-01-13 11:42:03,871 : INFO : training model with 4 workers on 62796 vocabulary and 300 features, using sg=1 hs=1 sample=0.001 negative=5 window=10

For comparison, the training on the original small corpus used:

2017-01-01 18:59:28,025 : INFO : collected 552467 word types from a corpus of 29323417 raw words and 1309803 sentences
2017-01-01 18:59:28,025 : INFO : Loading a fresh vocabulary
2017-01-01 18:59:28,468 : INFO : min_count=45 retains 28701 unique words (5% of original 552467, drops 523766)
2017-01-01 18:59:28,468 : INFO : min_count=45 leaves 27486765 word corpus (93% of original 29323417, drops 1836652)
2017-01-01 18:59:28,684 : INFO : deleting the raw counts dictionary of 552467 items
2017-01-01 18:59:28,829 : INFO : sample=0.001 downsamples 30 most-common words
2017-01-01 18:59:28,829 : INFO : downsampling leaves estimated 21256403 word corpus (77.3% of prior 27486765)
2017-01-01 18:59:28,830 : INFO : estimated required memory for 28701 words and 300 dimensions: 123414300 bytes
2017-01-01 18:59:28,901 : INFO : constructing a huffman tree from 28701 words
2017-01-01 18:59:30,174 : INFO : built huffman tree with maximum node depth 19
2017-01-01 18:59:30,236 : INFO : resetting layer weights
2017-01-01 18:59:30,830 : INFO : training model with 12 workers on 28701 vocabulary and 300 features, using sg=1 hs=1 sample=0.001 negati
ve=5 window=10

To summarize:

dkirkby commented 7 years ago

For comparison this article claims that:

At face value, this says the new corpus is ~7x EB!

dkirkby commented 7 years ago

The new corpus is drawn from an index of ~3.8M articles, compared with a total of ~5.4M in the entire wikipedia! Only a subset of articles are actually used in the final corpus, but this indicates that some articles must be appearing more than once in the index files, which I didn't account for.

dkirkby commented 7 years ago

Results from evaluate.py after 10 epochs (comparable to 100 epochs on the old corpus):

0.972 MARCH = april
0.931 FLUTE = clarinet
0.910 PIANO = violin
0.891 GOLD = silver
0.867 PISTOL = semi-automatic
0.852 CHOCOLATE = caramel
0.848 PANTS = trousers
0.845 MISSILE = surface-to-air
0.843 BERLIN = munich
0.840 TOKYO = osaka
0.837 DEGREE = bachelor
0.830 WHALE = humpback
0.828 CHURCH = episcopal
0.828 KETCHUP = mayonnaise
0.824 COURT = supreme
0.823 SERVER = client
0.823 THUMB = finger
0.821 JUPITER = neptune
0.814 GERMANY = austria
0.814 DISEASE = infection

0.849 PIANO + FLUTE = cello
0.805 PANTS + DRESS = trousers
0.755 LEMON + CHOCOLATE = vanilla
0.751 GERMANY + FRANCE = belgium
0.750 HORSESHOE + BAT = rhinolophus
0.729 STRING + PIANO = quartet
0.723 ICE_CREAM + CHOCOLATE = candy
0.719 PASTE + KETCHUP = garlic
0.718 WEB + SERVER = browser
0.709 TURKEY + GREECE = cyprus
0.707 HOTEL + CASINO = resort
0.703 ORGAN + FLUTE = harpsichord
0.703 PIANO + ORGAN = harpsichord
0.699 RABBIT + DOG = cat
0.696 PIANO + HORN = flute
0.690 STRING + FLUTE = violin
0.686 SCHOOL + DEGREE = graduate
0.679 HORN + FLUTE = trumpet
0.678 MOON + JUPITER = venus
0.672 GERMANY + CZECH = poland

Compare with the results for the old corpus in #9.

The new embedding looks good overall. Less obsessed with wrestling, but I wonder how many people would get the rhinolophus clue.