Jlebours / WikipediaExtractor_Python

0 stars 0 forks source link

Improvement of the extractor methods. #5

Open Jlebours opened 3 years ago

Jlebours commented 3 years ago

At the start we were only able to extract tables from this page : https://en.wikipedia.org/wiki/Comparison_(grammar) Tables are simple, with a first line of and other lines of , to do it, we used these two methods to collect the headers and the rows of the tables :

def get_Table_Headers(table):
    headers = []
    for th in table.find("tr").find_all("th"):
        headers.append(th.text.strip())
    return headers

def get_Table_Rows(table):
    """ Get all row for table content of `url` """
    rows = []
    for tr in table.find_all("tr")[1:]:
        print(tr.find_all)
        cells = []
        # grab all td tags in this table row
        tds = tr.find_all("td")
        if len(tds) == 0:
            # if no td tags, search for th tags
            # can be found especially in wikipedia tables below the table
            ths = tr.find_all("th")
            for th in ths:
                cells.append(th.text.strip())
        else:
            # use regular td tags
            for td in tds:
                cells.append(td.text.strip())
        rows.append(cells)
    return rows

The new goal was the extract tables from : https://en.wikipedia.org/wiki/Comparison_between_Esperanto_and_Ido The extractor must then extract tables with and for each for the first table of the page, and it must also extract tables with not only in the first , for the second table.

Jlebours commented 3 years ago

I did 2 new methods shorter, to replace methods in the description of the issue, which works for tables of the new page and the first page :

def get_headers(table):
    headers = []
    for column in table.find("tr").find_all(['td', 'th']):
        headers.append(column.text.strip())
    return headers

def get_rows(table):
    rows = []
    for row in table.find_all('tr')[1:]:
        rows.append([val.text.strip() for val in row.find_all(['td', 'th'])])
    return rows

Here is the csv that i extract with these methods, for the first table : image For the second table : image

Jlebours commented 3 years ago

I'm now trying to extract tables from this page : https://en.wikipedia.org/wiki/Comparison_between_Ido_and_Interlingua There are 8 tables, the extractor works for the first and the second but it fails to the third beacause there are colspan i think : image We must find a solution.

Jlebours commented 3 years ago

When we check how is modeled the csv of this table in the java project, we see a table with an empty column for the colspan : image

Jlebours commented 3 years ago

I found a solution which extract any table, it use the function pandas.read_html(), i realized a test code that I show you below with the results, i tried it on the page : https://en.wikipedia.org/wiki/Comparison_between_Esperanto_and_Ido

url = 'https://en.wikipedia.org/wiki/Comparison_between_Esperanto_and_Ido'
html = u_req.urlopen(url).read().decode("utf-8")
bs = BeautifulSoup(html, 'lxml')
table = str(bs.find_all('table', {'class': 'wikitable'}))
dfs = pd.read_html(table)
outname = f"Comparison_between_Esperanto_and_Ido.csv"
outdir = './output'
if not os.path.exists(outdir):
    os.mkdir(outdir)
i = 0
for table in dfs:
    i += 1
    fullname = os.path.join(outdir, outname) + f"{i}"
    table.to_csv(fullname, index=False)

image

Jlebours commented 3 years ago

Now i have to integrate this function in our code and adapt it, in order to run it for all urls.

Jlebours commented 3 years ago

I completely redone the extractor the implement the new methods. This is the project this the commit : https://github.com/Jlebours/WikipediaExtractor_Python/tree/2280e65449e5aba7484778941dbc81e74e27e488 I saved the old version in the OLD_version directory.

Jlebours commented 3 years ago

When i run the main for 1 extraction, it works well as we can see : image I readed the test.txt file : image

Jlebours commented 3 years ago

Now, when i run it on the test.txt with 2 urls, it doesn't work but i found why :

Traceback (most recent call last):
  File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\main.py", line 13, in <module>
    HTMLtoCSV.convert_csv(tables, name)
  File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\HTMLtoCSV.py", line 13, in convert_csv
    table.to_csv(fullname, index=False)
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\core\generic.py", line 3170, in to_csv
    formatter.save()
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\formats\csvs.py", line 185, in save
    f, handles = get_handle(
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\common.py", line 493, in get_handle
    f = open(path_or_buf, mode, encoding=encoding, errors=errors, newline="")
OSError: [Errno 22] Invalid argument: './output\\Comparison_between_U.S._states_and_countries_by_GDP_(PPP)\n_1.csv'

Process finished with exit code 1

When i display urls of the file I notice something, the are \n at the end of the urls as there are lots of newlines in the file, and that's why the extractor can't read the urls : image

I must now delete these \n in the urls.

Jlebours commented 3 years ago

I found the solution with the rstrip() method :

def read_urls():
    BASE_WIKIPEDIA_URL = "https://en.wikipedia.org/wiki/"
    allUrls = []
    with open("inputdata/wikiurls.txt", "r") as urls:
        for url in urls:
            finalUrl = BASE_WIKIPEDIA_URL + url
            allUrls.append([finalUrl.rstrip("\n"), url.rstrip("\n")])
    return allUrls

Now when i run the extractor, it fails on wikipedia pages which don't exist, like : https://en.wikipedia.org/wiki/Comparison_of_Axis_&_Allies_games

Jlebours commented 3 years ago

On wikipedia pages which don't exist or without tables, i have this error :

Traceback (most recent call last):
  File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\main.py", line 13, in <module>
    tables = ExtractHTML.get_tables(url)
  File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\ExtractHTML.py", line 21, in get_tables
    dfs = pandas.read_html(tables)
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 1086, in read_html
    return _parse(
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 917, in _parse
    raise retained
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 898, in _parse
    tables = p.parse_tables()
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 217, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 547, in _parse_tables
    raise ValueError("No tables found")
ValueError: No tables found

Process finished with exit code 1
Jlebours commented 3 years ago

I added a method which check if the wikipedia page is valid :

def is_url_valid(url):
    r = requests.head(f"{url}")
    return r.status_code == 200

It still don't work. I must find a solution to continue my extraction and do nothing when there is no table on the page.

Jlebours commented 3 years ago

I found a solution to check if there are arrays in wikipedia pages and then run the extractor. To test my code i used my test.txt with a page which don't exist and a page without arrays : Comparison_ofAxis&_Allies_games Comparison_of_C_Sharp_and_VisualBasic.NET I wanted to check if my function which get tables from wikipages works so i made some print :

def get_tables(url):
    dfs = []
    html = u_req.urlopen(url).read().decode("utf-8")
    bs = BeautifulSoup(html, 'lxml')
    tables = str(bs.find_all('table', {'class': 'wikitable'}))
    print(tables)
    if not tables:
        print("array is empty")
        return dfs
    else:
        print("array is not empty")
        dfs = pandas.read_html(str(tables))
    print(tables)
    return dfs

In this case, even if tables was empty, it enter in the else and then execute the line with the read_html which generate the error.

Jlebours commented 3 years ago

I then separated the functions in the main, before i check if i can get tables :

def get_tables(url):
    html = u_req.urlopen(url).read().decode("utf-8")
    bs = BeautifulSoup(html, 'lxml')
    tables = str(bs.find_all('table', {'class': 'wikitable'}))
    return tables

and the in the main, i check if the result of the function is empty...

        if ExtractHTML.is_url_valid(url):
            tables = ExtractHTML.get_tables(url)
            # Wikipedia pages with no tables == "[]"
            if tables != "[]":
                dfs = pandas.read_html(str(tables))
                HTMLtoCSV.convert_csv(dfs, name)
                nbTables += len(tables)

And it works : image

Jlebours commented 3 years ago

Now i can extract tables from 326 wikipedia pages on the 336 ! It failed on the 327 cause of this error :

Url 327 on 336
Traceback (most recent call last):
  File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\main.py", line 21, in <module>
    dfs = pandas.read_html(str(tables))
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 1086, in read_html
    return _parse(
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 920, in _parse
    for table in tables:
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 218, in <genexpr>
    return (self._parse_thead_tbody_tfoot(table) for table in tables)
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 417, in _parse_thead_tbody_tfoot
    body = self._expand_colspan_rowspan(body_rows)
  File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 462, in _expand_colspan_rowspan
    rowspan = int(self._attr_getter(td, "rowspan") or 1)
ValueError: invalid literal for int() with base 10: '31"'
Jlebours commented 3 years ago

I found the problem on the wikipedia page : image The solution is to catch the ValueError as i did :

            if tables != "[]":
                try:
                    dfs = pandas.read_html(str(tables))
                except Exception as exc:
                    print("Exception type: ", exc.__class__)
                    print(f"A wikitable in the url {i} : {name} have a syntax problem, so it can't extract it")
                else:
                    HTMLtoCSV.convert_csv(dfs, name)

image

Jlebours commented 3 years ago

I can now run the extractor on the 336 urls, here is the result ! :D image