barrust / mediawiki

MediaWiki API wrapper in python http://pymediawiki.readthedocs.io/en/latest/
MIT License
181 stars 29 forks source link

Support references scraping link title with URL? #33

Closed vesche closed 7 years ago

vesche commented 7 years ago

I'm working on a tool where I want to scrape the common link "Official Website" from the "External links" section that appears in almost every company and organization article (for example). According to the references function it will return links that appear in the "External links" section. My problem is that I cannot easily locate which link is the "Official Website" as it returns a giant list of URL's.

Perhaps references could return a dictionary that contained both the URL and name of the link?

Something like this:

>>> from mediawiki import MediaWiki
>>> wikipedia = MediaWiki()
>>> page = wikipedia.page("McDonald's")
>>> page.references
{ ... "Official Website": "https://www.mcdonalds.com" ... }
>>> page.references["Official Website"]
"https://www.mcdonalds.com"

I'm submitting this issue here as it seems to be the most updated and active python MediaWiki wrapper. Thanks for your work on this. I'll be looking to see if I can add this feature myself, however I'm sure you are more familiar with the source code and may have a better solution to this problem. Thanks.

Edit: Hmm, after some digging it looks like the MediaWiki API just doesn't support returning the link title. I hope I can solve this problem without having to do some regex or beautifulsoup on page.html or something.

barrust commented 7 years ago

@vesche You are correct in that this is something that is not currently supported by the MediaWiki api. There have been times when I too would like to know the title of the link as shown on the page.

I think is feasible but probably will require using BeautifulSoup on the html property. I think this would be a great addition to the project. This is a hobby project so it may take some time to come up with a solution, but I will definitely put it on the backlog! If you come up with a solution, I would love to incorporate it!

Thanks!

vesche commented 7 years ago

Thanks for the reply, I ended up solving my problem by doing this which isn't horrible I suppose. I'm going to close this as it seems to be an issue for the MediaWiki API and not your wrapper.

from bs4 import BeautifulSoup
from mediawiki import MediaWiki

wikipedia = MediaWiki()
page = wikipedia.page("McDonald's")

soup = BeautifulSoup(page.html, 'html.parser')

url_titles = [ 'Official website', 'Official site', 'Corporate website' ]
for link in soup.findAll('a'):
    for t in url_titles:
        if (t == link.string) or (t.title() == link.string):
            print(link['href'])
barrust commented 7 years ago

@vesche I decided that, even though this is not a part of the API, I am already parsing the html and content for other functions. This seemed like another good addition to the api. As of PR #34 you can use the parse_section_links(section) to get all the links from the external links section. It returns a list of tuples, in order based on the html markup of links and the text representing the link in the desired section.

So to pull the external links you can use:

>>> from mediawiki import MediaWiki
>>> wikipedia = MediaWiki()
>>> page = wikipedia.page("McDonald's")
>>> page.parse_section_links('External Links')
[('Official Website, 'https://www.mcdonalds.com'), ...]
vesche commented 7 years ago

Awesome! Thanks so much for this addition, I'll be using it for an OSINT tool I'm writing. Your style is very clean, you're a beast man.

barrust commented 7 years ago

You are welcome! If you run into other things that you think would be good additions, let me know! I am just happy to hear that my little side project is useful!

barrust commented 7 years ago

Also, I pushed this code to pypi so you can upgrade from 0.3.14 to 0.3.15