UChicago-Computational-Content-Analysis / Frequently-Asked-Questions

0 stars 0 forks source link

Meaning of . findAll in hw examples #7

Open chentian418 opened 2 years ago

chentian418 commented 2 years ago

Hi, in the second example, I find there is one line of code: tagLinks = pTag.findAll('a', href=re.compile('/wiki/'), class_=False)

And I want to make sure if this line is used to find the string that starts with "a" and href='/wiki/'; for example:<a href="/wiki/Mass_communication" title="Mass communication">mass communication</a>

However, when I use pTag.findAll('a', href=re.compile('https://www.nytimes.com'), class_=False) with no base url to extract <a class="css-1g7m0tk" href="https://www.nytimes.com/2021/07/23/technology/silicon-valleys-pandemic-profits.html" title="">, it doesn't return anything.

Would you mind explaining a bit about the meaning of the codes and my problem. Thank you!

JunsolKim commented 2 years ago

You are correct. pTag.findAll('a', href=re.compile('/wiki/'), class_=False) returns a list of strings that (1) starts with <a and (2) contains /wiki/ substring (in the href attribute).

If you use pTag.findAll('a', href=re.compile('https://www.nytimes.com'), class_=False), this should return a list of strings that (1) starts with <a and (2) contains https://www.nytimes.com substring (e.g., href="https://www.nytimes.com/2021/07/23/technology/silicon-valleys-pandemic-profits.html").

If your code does not return any result, make sure that pTag contains the strings that you are looking for. Would you try this code:

import bs4
import requests
url = "address of webpage that includes <a class..."
req = requests.get(url)
soup = bs4.BeautifulSoup(req.text, 'html.parser')
print(soup.findAll('a', href=re.compile('https://www.nytimes.com'), class_=False))