Parse links (href) on tables in aspx (HTML Type) with Python - Githubissues

eduardolundgren / tracking.js

A modern approach for Computer Vision on the web

http://trackingjs.com

Other

9.43k stars 1.44k forks source link

Parse links (href) on tables in aspx (HTML Type) with Python #344

Closed TeonaEcon closed 4 years ago

TeonaEcon commented 5 years ago

Python version: 3.6
Operating System: Windows

Description

I am trying to get all the linked pages given on the table on this aspx page: https://reportal.ge/BannersMenu/Detailed-search-for-reports.aspx?lang=en-US

What I Did

I tried to parse the HTML and get all links from there. The only thing what gave me back some result was:

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client
import csv

page_url = "https://reportal.ge/Forms.aspx?payerCode=401985107&SystemID=9571&show=1&np=1&cid=-1&prd=show?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
# opens the connection and downloads html page from url
uClient = uReq(page_url)

# parses html into a soup data structure
# as if it were a json data type.
page_soup = soup(uClient.read(), "html.parser") #parses/cuts the HTML
print(page_soup) #prints the HTML.

Please, would you have any suggestion how to get the linked pages (they are individual pages for each company)?

murat-aka commented 5 years ago

https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe

TeonaEcon commented 5 years ago

I did try that as well, but could not find the links I wanted .

The link of the webpage: "https://reportal.ge/Forms.aspx?payerCode=204935400&SystemID=6160&show=1&np=1&cid=IV&prd=show"

From the table I need to get the links to the company pages. The code which scrapes links (hrefs):

`#pip install regex from bs4 import BeautifulSoup from urllib.request import Request, urlopen import re

req = Request("https://reportal.ge/Forms.aspx?payerCode=204935400&SystemID=6160&show=1&np=1&cid=IV&prd=show") html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = [] for link in soup.findAll('a'): links.append(link.get('href'))

print(links)`

does not print the right links (href) of company pages, but works for other websites

Code behind the website

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

ანგარიშგების პორტალი - ანგარიშგებების რეესტრი

საიტი მუშაობს სატესტო რეჟიმში

მიიღე ინფორმაცია სუბიექტის შესახებ:

ანგარიშგებების პორტალზე სუბიექტის შესახებ ხელმისაწვდომია შემდეგი ტიპის ინფორმაცია:

სუბიექტის პროფილის ინფორმაცია
სუბიექტის ფინანსური და/ან მმართველობის ანგარიშგება
სუბიექტის ჯგუფის შესახებ ინფორმაცია
სუბიექტის აუდიტორების შესახებ ინფორმაცია

სუბიექტის ძიება

სუბიექტის დასახელება/საიდენტ. კოდი

სამართლებრივი ფორმა

კატეგორია

საიდენტიფიკაციო #

დასახელება

კატეგორია

ძირითადი საქმიანობა

სამართლებრივი ფორმა

205156374

სს საქართველოს ფასიანი ქაღალდების გაერთიანებული რეგისტრატორი

IV

საფინანსო მომსახურების სხვა დამხმარე საქმიანობები, სადაზღვევო და საპენსიო ფონდების გარდა

სააქციო საზოგადოება

404534170

სს ზირაათ ბანკი საქართველო

სდპ

კომერციული ბანკების საქმიანობა

სააქციო საზოგადოება

401985107

შპს მიკროსაფინანსო ორგანიზაცია ჯორჯიან ინტერნეიშენალ მისო

სდპ

სხვა სახის საკრედიტო მომსახურება

შპს

216425919

შპს ჯეოსთილი

I

ფოლადის მილების, მილსადენების, ღრუ პროფილების და მსგავსი ფიტინგების წარმოება

შპს

412675779

შპს მიკროსაფინანსო ორგანიზაცია (მისო) სვის-კრედიტი

სდპ

სხვა სახის საკრედიტო მომსახურება

შპს

205274273

სს მიკროსაფინანსო ორგანიზაცია სვის კაპიტალ

სდპ

სხვა სახის საკრედიტო მომსახურება

სააქციო საზოგადოება

204929961

შპს თიბისი კაპიტალი

IV

ფასიან ქაღალდებთან და სასაქონლო კონტრაქტებთან დაკავშირებული საბროკერო მომსახურება

შპს

204542003

სს ექსპრეს ტექნოლოჯიზ

IV

საფინანსო მომსახურების სხვა საქმიანობები, სადაზღვევო და საპენსიო ფონდების გარდა, სხვა დაჯგუფებებში ჩაურთველი

სააქციო საზოგადოება

204970031

სს დაზღვევის კომპანია ქართუ

სდპ

დაზღვევის სხვა სახეები

სააქციო საზოგადოება

404910637

შპს მიკროსაფინანსო ორგანიზაცია კრედიტორი

სდპ

სხვა სახის საკრედიტო მომსახურება

შპს

1

murat-aka commented 5 years ago

https://stackoverflow.com/a/16323809/3033613

Also this, scroll down to find all()

http://www.compjour.org/warmups/govt-text-releases/intro-to-bs4-lxml-parsing-wh-press-briefings/

TeonaEcon commented 4 years ago

I used ScrapStorm AI based program :) I found it faster.