Ch. 6. Word Vector Embeddings - Python 2.7

arturovivas commented 7 years ago

Hi Guys,

I've been using the book in the last days and now I arrived at Ch 6. I am struggling to make the code from the section Word Vector Embeddings work using python 2.7.

Until now I have changed a few things, i.e.

All the bz2.open() occurrences replaced with bz2.BZ2File()
from urllib.request import urlopen replaced with from urllib import urlopen

Now I am stuck with the _read_pages(self, url) method... This is the error that get.

Read pages

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-45449cf53ecf> in <module>()
    146     'enwiki-20161120-pages-meta-current1.xml-p000000010p000030303.bz2',
    147     './',
--> 148     params.vocabulary_size)
    149 
    150 # def skipgrams(pages, max_context)

<ipython-input-6-45449cf53ecf> in __init__(self, url, cache_dir, vocabulary_size)
      9         if not os.path.isfile(self._pages_path):
     10             print('Read pages')
---> 11             self._read_pages(url)
     12         if not os.path.isfile(self._vocabulary_path):
     13             print('Build vocabulaty')

<ipython-input-6-45449cf53ecf> in _read_pages(self, url)
     46                     continue
     47                 page = element.findtext('./{*}revision/{*}test')
---> 48                 words = self._tokenize(page)
     49                 pages.write(''.join(words) + '\n')
     50                 element.clear()

<ipython-input-6-45449cf53ecf> in _tokenize(cls, page)
     55         # *ERROR expected string or buffer
     56 
---> 57         words = cls.TOKEN_REGEX.findall(page)
     58 
     59         words = [x.lower() for x in words]

TypeError: expected string or buffer

I copied the code to my repository <-- Thanks for helping out!

samjabrahams commented 7 years ago

There's a typo in your code. The word "text" was misspelled as "test" :

page = element.findtext('./{*}revision/{*}test')

Should be

page = element.findtext('./{*}revision/{*}text')

Let me know if that change works.

samjabrahams commented 7 years ago

You'll also need to import random at the top for the skipgrams function.

samjabrahams commented 7 years ago

Hi @arturovivas - could you confirm whether this worked or not? If so, I can close the issue.

arturovivas commented 7 years ago

Hi @samjabrahams Thanks for the replay! I was checking it today and it works. I confirm that this version is now working in python 2.7. I also performed some minor changes. I will upload tomorrow the code to my repository in the case you want to copy it and create a folder in the official book repository for the py 2.7 Version. My goal is to finish your book with the py2.7 version. So I will continue uploading the other codes as soon as I finish with them.

samjabrahams commented 7 years ago

Great, glad that this worked out for you. Thanks for keeping us updated with your code!

backstopmedia / tensorflowbook

Ch. 6. Word Vector Embeddings - Python 2.7 #17