REMitchell / python-scraping

Code samples from the book Web Scraping with Python http://shop.oreilly.com/product/0636920034391.do
4.42k stars 2.48k forks source link

Question in ch2 #76

Open shufanzhang opened 5 years ago

shufanzhang commented 5 years ago

from urllib.request import urlopen from bs4 import BeautifulSoup html=urlopen("http://www.pythonscraping.com/pages/warandpeace.html") bs=BeautifulSoup(html,"html.parser") nameList = bs.find_all(text='the prince') print(len(nameList))

I run the code above and the result is 7. However, when I use 'ctrl+F' to search 'the prince' in the the browser, the result is 11. I'm confused why the results are inconsistent.

Proteusiq commented 5 years ago

That is because of casing. You have only captured 'the prince' but left out 'The prince' :) I got 11 by doing similar but with requests. You can just replace find_prince in your original code and it will work too

import re

import requests
from bs4 import BeautifulSoup

URL = "http://www.pythonscraping.com/pages/warandpeace.html"

# ignoring casing
find_prince = re.compile(r'the prince', re.IGNORECASE)

s = requests.Session()
r = s.get(URL)

soup = BeautifulSoup(r.content,'html5lib')

prince_found = soup.find_all(text = find_prince)

print(len(prince_found)) #11