bundesAPI / deutschland

Die wichtigsten APIs Deutschlands in einem Python Paket.
Apache License 2.0
1.18k stars 66 forks source link

TypeError: 'NoneType' object is not subscriptable in __generate_result method #145

Open christian-unlockai opened 4 months ago

christian-unlockai commented 4 months ago

Description

Hi,

I encountered an issue with the deutschland package while trying to fetch financial reports using the Bundesanzeiger class. When running my tests, I received a TypeError indicating that a 'NoneType' object is not subscriptable. This occurs in the __generate_result method when trying to access the captcha_wrapper div.

Error Details

TypeError: 'NoneType' object is not subscriptable

Steps to Reproduce

  1. Initialize the Bundesanzeiger class.
  2. Call the get_reports method with a valid search term.
  3. Observe the error in the __generate_result method.

Code Snippet

Here is the relevant part of the code where the error occurs:

def __generate_result(self, content: str):
        """iterate trough all results and try to fetch single reports"""
        result = {}
        for element in self.__find_all_entries_on_page(content):
            get_element_response = self.__get_response(element.content_url)

            if self.__is_captcha_needed(get_element_response.text):
                soup = BeautifulSoup(get_element_response.text, "html.parser")
                captcha_image_src = soup.find("div", {"class": "captcha_wrapper"}).find(
                    "img"
                )["src"]
                img_response = self.__get_response(captcha_image_src)
                captcha_result = self.captcha_callback(img_response.content)
                captcha_endpoint_url = soup.find_all("form")[1]["action"]
                get_element_response = self.session.post(
                    captcha_endpoint_url,
                    data={"solution": captcha_result, "confirm-button": "OK"},
                )

            content_soup = BeautifulSoup(get_element_response.text, "html.parser")
            content_element = content_soup.find(
                "div", {"class": "publication_container"}
            )

            if not content_element:
                continue

            element.report = content_element.text
            element.raw_report = content_element.prettify()

            result[element.to_hash()] = element.to_dict()

        return result

Additional Information

Logs

2024-06-01 14:17:21 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchen2?4-1.-search~table~panel-rows-2-search~table~row~panel-publication~link HTTP/1.1" 302 0 (connectionpool.py:549)
2024-06-01 14:17:21 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchergebnis?7 HTTP/1.1" 200 None (connectionpool.py:549)
2024-06-01 14:17:21 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchergebnis?7--captcha~panel-captcha_form-captcha_image&antiCache=1717244241383 HTTP/1.1" 200 None (connectionpool.py:549)
2024-06-01 14:17:23 [DEBUG] https://www.bundesanzeiger.de:443 "POST /pub/de/suchergebnis?7-1.-captcha~panel-captcha_form HTTP/1.1" 302 0 (connectionpool.py:549)
2024-06-01 14:17:23 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchergebnis?9 HTTP/1.1" 200 None (connectionpool.py:549)
2024-06-01 14:17:23 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchen2?4-1.-search~table~panel-rows-3-search~table~row~panel-publication~link HTTP/1.1" 302 0 (connectionpool.py:549)
2024-06-01 14:17:23 [DEBUG] https://www.bundesanzeiger.de:443 "GET /pub/de/suchergebnis?10 HTTP/1.1" 200 None (connectionpool.py:549)

Please let me know if further information is needed.

Thank you!

Christian

davidrzs commented 4 months ago

Can confirm the issue.

wirthual commented 4 months ago

Hi,

thank you for the detailed description. From the error I would assume the captcha was either removed or the site structure changed. If its number one, we can simply take out the test.

I removed the following section and I was able to retrieve a result.

if self.__is_captcha_needed(get_element_response.text):
          soup = BeautifulSoup(get_element_response.text, "html.parser")
          captcha_image_src = soup.find("div", {"class": "captcha_wrapper"}).find(
              "img"
          )["src"]
          img_response = self.__get_response(captcha_image_src)
          captcha_result = self.captcha_callback(img_response.content)
          captcha_endpoint_url = soup.find_all("form")[1]["action"]
          get_element_response = self.session.post(
              captcha_endpoint_url,
              data={"solution": captcha_result, "confirm-button": "OK"},
          )

This was the code I ran:

from deutschland.bundesanzeiger import Bundesanzeiger
ba = Bundesanzeiger()
# search term
data = ba.get_reports("Deutsche Bahn AG")
# returns a dictionary with all reports found as fulltext reports
print(data.keys())

With results: dict_keys(['4442fe462193acf9a4bf741516a00dfa'])

The question is if this works for all cases, or if the captcha still appears with a changed structure. In that case we would need to adapt the detection of the captcha.