bundesAPI / deutschland

Die wichtigsten APIs Deutschlands in einem Python Paket.
Apache License 2.0
1.16k stars 66 forks source link

error when trying to extend bundesanzeiger search #75

Open time4breakfast opened 1 year ago

time4breakfast commented 1 year ago

I thought about contributing to your package by adding the extended search functionality (i.e. not only search for all documents but add the possibility to limit the search to certain types of documents). Unfortunately, this is only working for certain companies while for certain other companies the captcha solver always fails. Any ideas why that might be? (e.g. it works without errors for "Deutsche Bahn AG" but it keeps failing for "Deutsche Bank AG")

image Change: add the value 22 to the search request response = self.session.get( f"https://www.bundesanzeiger.de/pub/de/start?0-2.-top%7Econtent%7Epanel-left%7Ecard-form=&fulltext={company_name}&area_select=22&search_button=Suchen" )

time4breakfast commented 1 year ago

Just learned, that there is a new format, called ESEF. Reports using this new format do not have a captcha that needs to be solved, which is why the soup.find() function returns NoneType.

wirthual commented 1 year ago

Thanks for looking into this.

Does this mean we need to adapt or extend our code?

mariedittmer commented 1 year ago

Just learned, that there is a new format, called ESEF. Reports using this new format do not have a captcha that needs to be solved, which is why the soup.find() function returns NoneType.

Hi, did you solve it? I guess I have the same problem.. would be super nice to adapt n the code! :)

wirthual commented 1 year ago

Well you could add an additional check here if a captcha is there: Something like

if soup.find("div", {"class": "captcha_wrapper"}) is not None:
    //Solve the captcha here
time4breakfast commented 1 year ago

Hi,

we kind of solved it/implemented a workaround for our use case: If soup.find() returns NoneType then assume that this is an ESEF (so no need to solve a captcha) and just find and click the "accept" button on the website. After that, we implemented a function or two that are able to read process the esef viewer (which painfully slows down your browser when trying to work with it or just view something).

I don't have the code here with me but will provide it after the holidays.

mdittmer-A commented 1 year ago

Hi,

thank you very much, it would be great to share the function for the esef viewer, do you know if there are plans to make a PR for this feature?

thanks in advance

jurekmff commented 9 months ago

Hi @time4breakfast, I am running into the same issue. Do you mind sharing your code? Thanks a lot

time4breakfast commented 7 months ago

Hi jurekmff, sorry for the late reply.

Current situation: I switched companies and do not have access to the code anymore. But I have found a test sample on my machine (which unfortunately imports my own, corrected version of the deutschland api that I don't have anymore -.-), that I will share with you.

In theory, what you need to do to fix the error is:

The esef report itself works a little bit different than the old format: Using a standard browser (just as a normal user, looking at it) it will open inside its very own "viewer" implementation which took forever on my machine to load and usually also kind of slowed down the whole computer. The format itself subdivides into several (kind of) pages containing different contents and use-cases. Using the code sample I'll provide further down in this post you can start and try accessing the esef report(s) for yourself. The sample was done for Deutsche Bank. You should be fine replacing the first import "from handelsregister_updates import Bundesanzeiger" to just "from deutschland.bundesanzeiger import Bundesanzeiger" and make the necessary try-except adaption.

Also, keep in mind that the domain bundesanzeiger.de will change in the future to unternehmensregister.de. In the background it is the same company but they are trying to separate the data, domains and everything more clearly.

Hope that helps. If you have any further questions, don't hesitate to ask. I'll hope to be able to answer more quickly in the future.

Best regards time4breakfast


# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

from handelsregister_updates import Bundesanzeiger
ba = Bundesanzeiger()
reports = ba.get_reports("Deutsche Bahn AG")

#GET /pub/de/start?12-2.-top%7Econtent%7Epanel-left%7Ecard-form=&fulltext=Deutsche+Bank+AG&area_select=22&search_button=Suchen HTTP/1.1

import requests
from bs4 import BeautifulSoup
import dateparser

session = requests.Session()
session.cookies["cc"] = "1663315556-37c8ed90cc5e8d6c-10"
session.headers.update(
            {
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                "Accept-Encoding": "gzip, deflate, br",
                "Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,et;q=0.6,pl;q=0.5",
                "Cache-Control": "no-cache",
                "Connection": "keep-alive",
                "DNT": "1",
                "Host": "www.bundesanzeiger.de",
                "Pragma": "no-cache",
                "Referer": "https://www.bundesanzeiger.de/",
                "sec-ch-ua-mobile": "?0",
                "Sec-Fetch-Dest": "document",
                "Sec-Fetch-Mode": "navigate",
                "Sec-Fetch-Site": "same-origin",
                "Sec-Fetch-User": "?1",
                "Upgrade-Insecure-Requests": "1",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
            }
        )
# get the jsessionid cookie
response = session.get("https://www.bundesanzeiger.de")
# go to the start page
response = session.get("https://www.bundesanzeiger.de/pub/de/start?0")
# perform the search
response = session.get(
    "https://www.bundesanzeiger.de/pub/de/start?0-2.-top%7Econtent%7Epanel-left%7Ecard-form=&fulltext=Deutsche+Bank+AG&area_select=22&search_button=Suchen"
)

def find_all_entries_on_page(page_content: str):
    soup = BeautifulSoup(page_content, "html.parser")
    wrapper = soup.find("div", {"class": "result_container"})
    rows = wrapper.find_all("div", {"class": "row"})
    for row in rows:
        info_element = row.find("div", {"class": "info"})
        if not info_element:
            continue

        link_element = info_element.find("a")
        if not link_element:
            continue

        entry_link = link_element.get("href")
        entry_name = link_element.contents[0].strip()

        date_element = row.find("div", {"class": "date"})
        if not date_element:
            continue

        date = dateparser.parse(date_element.contents[0], languages=["de"])

        company_name_element = row.find("div", {"class": "first"})
        if not date_element:
            continue

        company_name = company_name_element.contents[0].strip()

        yield date, entry_name, entry_link, company_name

# get menu of esef report
def get_esef_menu(find_res):
    menulist = []
    for menu_item in find_res:
        menulist.append({"link":menu_item.find("a", {"class": "link-file"})['href'],
                         "name":menu_item.find("a", {"class": "link-file"})['title']})
    return menulist

# list all results
result = []
for element in find_all_entries_on_page(response.text):
    result.append(element)

# extract esef_report as xml/BeautifulSoup object from link
def get_esef_report(esef_link):
    ja_db = session.get(esef_link)
    mysoup = BeautifulSoup(ja_db.text.encode('utf-8'), "lxml")

    return mysoup

# find esefs within the results
esef_list = []
for entry in result:
    get_element_response = session.get(entry[2])
    soup = BeautifulSoup(get_element_response.text, "html.parser")
    if soup.find("div", {"class": "esef-select-container"}) is not None:
        esef_session = session.get(soup.find("div", {"class": "esef-select-container"}).find("a", {"class": "btn btn-primary"})['href'])
        esef_bs = BeautifulSoup(esef_session.text.encode("utf-8"), "html.parser")
        esef_menu = get_esef_menu(esef_bs.find_all("div", {"class": "file-list-item level-1"}))
        esef_list.append(esef_menu)
        # get esef reports
        for entry in esef_menu:
            get_esef_report(entry['link'])

# find text in BeautifulSoup object
mysoup(text = lambda t: "Honorar" in t.text)