, to do it, we used these two methods to collect the headers and the rows of the tables :
def get_Table_Headers(table):
headers = []
for th in table.find("tr").find_all("th"):
headers.append(th.text.strip())
return headers
def get_Table_Rows(table):
""" Get all row for table content of `url` """
rows = []
for tr in table.find_all("tr")[1:]:
print(tr.find_all)
cells = []
# grab all td tags in this table row
tds = tr.find_all("td")
if len(tds) == 0:
# if no td tags, search for th tags
# can be found especially in wikipedia tables below the table
ths = tr.find_all("th")
for th in ths:
cells.append(th.text.strip())
else:
# use regular td tags
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
return rows
I did 2 new methods shorter, to replace methods in the description of the issue, which works for tables of the new page and the first page :
def get_headers(table):
headers = []
for column in table.find("tr").find_all(['td', 'th']):
headers.append(column.text.strip())
return headers
def get_rows(table):
rows = []
for row in table.find_all('tr')[1:]:
rows.append([val.text.strip() for val in row.find_all(['td', 'th'])])
return rows
Here is the csv that i extract with these methods, for the first table :
For the second table :
I'm now trying to extract tables from this page : https://en.wikipedia.org/wiki/Comparison_between_Ido_and_Interlingua
There are 8 tables, the extractor works for the first and the second but it fails to the third beacause there are colspan i think :
We must find a solution.
Now, when i run it on the test.txt with 2 urls, it doesn't work but i found why :
Traceback (most recent call last):
File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\main.py", line 13, in <module>
HTMLtoCSV.convert_csv(tables, name)
File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\HTMLtoCSV.py", line 13, in convert_csv
table.to_csv(fullname, index=False)
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\core\generic.py", line 3170, in to_csv
formatter.save()
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\formats\csvs.py", line 185, in save
f, handles = get_handle(
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\common.py", line 493, in get_handle
f = open(path_or_buf, mode, encoding=encoding, errors=errors, newline="")
OSError: [Errno 22] Invalid argument: './output\\Comparison_between_U.S._states_and_countries_by_GDP_(PPP)\n_1.csv'
Process finished with exit code 1
When i display urls of the file I notice something, the are \n at the end of the urls as there are lots of newlines in the file, and that's why the extractor can't read the urls :
On wikipedia pages which don't exist or without tables, i have this error :
Traceback (most recent call last):
File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\main.py", line 13, in <module>
tables = ExtractHTML.get_tables(url)
File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\ExtractHTML.py", line 21, in get_tables
dfs = pandas.read_html(tables)
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 1086, in read_html
return _parse(
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 917, in _parse
raise retained
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 898, in _parse
tables = p.parse_tables()
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 217, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 547, in _parse_tables
raise ValueError("No tables found")
ValueError: No tables found
Process finished with exit code 1
I found a solution to check if there are arrays in wikipedia pages and then run the extractor.
To test my code i used my test.txt with a page which don't exist and a page without arrays :
Comparison_ofAxis&_Allies_games
Comparison_of_C_Sharp_and_VisualBasic.NET
I wanted to check if my function which get tables from wikipages works so i made some print :
def get_tables(url):
dfs = []
html = u_req.urlopen(url).read().decode("utf-8")
bs = BeautifulSoup(html, 'lxml')
tables = str(bs.find_all('table', {'class': 'wikitable'}))
print(tables)
if not tables:
print("array is empty")
return dfs
else:
print("array is not empty")
dfs = pandas.read_html(str(tables))
print(tables)
return dfs
In this case, even if tables was empty, it enter in the else and then execute the line with the read_html which generate the error.
and the in the main, i check if the result of the function is empty...
if ExtractHTML.is_url_valid(url):
tables = ExtractHTML.get_tables(url)
# Wikipedia pages with no tables == "[]"
if tables != "[]":
dfs = pandas.read_html(str(tables))
HTMLtoCSV.convert_csv(dfs, name)
nbTables += len(tables)
Now i can extract tables from 326 wikipedia pages on the 336 !
It failed on the 327 cause of this error :
Url 327 on 336
Traceback (most recent call last):
File "C:\Users\Johan\PycharmProjects\WikipediaExtractor_Python\main.py", line 21, in <module>
dfs = pandas.read_html(str(tables))
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 1086, in read_html
return _parse(
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 920, in _parse
for table in tables:
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 218, in <genexpr>
return (self._parse_thead_tbody_tfoot(table) for table in tables)
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 417, in _parse_thead_tbody_tfoot
body = self._expand_colspan_rowspan(body_rows)
File "C:\Users\Johan\AppData\Local\Programs\Python\Python39-32\lib\site-packages\pandas\io\html.py", line 462, in _expand_colspan_rowspan
rowspan = int(self._attr_getter(td, "rowspan") or 1)
ValueError: invalid literal for int() with base 10: '31"'
I found the problem on the wikipedia page :
The solution is to catch the ValueError as i did :
if tables != "[]":
try:
dfs = pandas.read_html(str(tables))
except Exception as exc:
print("Exception type: ", exc.__class__)
print(f"A wikitable in the url {i} : {name} have a syntax problem, so it can't extract it")
else:
HTMLtoCSV.convert_csv(dfs, name)
At the start we were only able to extract tables from this page : https://en.wikipedia.org/wiki/Comparison_(grammar) Tables are simple, with a first line of
The new goal was the extract tables from : https://en.wikipedia.org/wiki/Comparison_between_Esperanto_and_Ido The extractor must then extract tables with
I did 2 new methods shorter, to replace methods in the description of the issue, which works for tables of the new page and the first page :
Here is the csv that i extract with these methods, for the first table : For the second table :
I'm now trying to extract tables from this page : https://en.wikipedia.org/wiki/Comparison_between_Ido_and_Interlingua There are 8 tables, the extractor works for the first and the second but it fails to the third beacause there are colspan i think : We must find a solution.
When we check how is modeled the csv of this table in the java project, we see a table with an empty column for the colspan :
I found a solution which extract any table, it use the function pandas.read_html(), i realized a test code that I show you below with the results, i tried it on the page : https://en.wikipedia.org/wiki/Comparison_between_Esperanto_and_Ido
Now i have to integrate this function in our code and adapt it, in order to run it for all urls.
I completely redone the extractor the implement the new methods. This is the project this the commit : https://github.com/Jlebours/WikipediaExtractor_Python/tree/2280e65449e5aba7484778941dbc81e74e27e488 I saved the old version in the OLD_version directory.
When i run the main for 1 extraction, it works well as we can see : I readed the test.txt file :
Now, when i run it on the test.txt with 2 urls, it doesn't work but i found why :
When i display urls of the file I notice something, the are \n at the end of the urls as there are lots of newlines in the file, and that's why the extractor can't read the urls :
I must now delete these \n in the urls.
I found the solution with the rstrip() method :
Now when i run the extractor, it fails on wikipedia pages which don't exist, like : https://en.wikipedia.org/wiki/Comparison_of_Axis_&_Allies_games
On wikipedia pages which don't exist or without tables, i have this error :
I added a method which check if the wikipedia page is valid :
It still don't work. I must find a solution to continue my extraction and do nothing when there is no table on the page.
I found a solution to check if there are arrays in wikipedia pages and then run the extractor. To test my code i used my test.txt with a page which don't exist and a page without arrays : Comparison_ofAxis&_Allies_games Comparison_of_C_Sharp_and_VisualBasic.NET I wanted to check if my function which get tables from wikipages works so i made some print :
In this case, even if tables was empty, it enter in the else and then execute the line with the read_html which generate the error.
I then separated the functions in the main, before i check if i can get tables :
and the in the main, i check if the result of the function is empty...
And it works :
Now i can extract tables from 326 wikipedia pages on the 336 ! It failed on the 327 cause of this error :
I found the problem on the wikipedia page : The solution is to catch the ValueError as i did :
I can now run the extractor on the 336 urls, here is the result ! :D