What is utils.get_text_from_blob param with_encoding meant to do?

odscjames commented 1 year ago

The first use inside the function makes it return the content as bytes, and assuming 'utf-8'

The second use inside the function returns back to the user the charset used, but surely because of the first use the second use is never called?

The only time the function is actually called with the param set to True is in an error handler of clean.clean_invalid_documents and I don't understand that error handler, as it could end up leaving iati_activities_el undefined and then the very next line tries to use that variable.

https://github.com/IATI/refresher/commit/b011174b15c8ab5994fe3a039a12d3924306254f adds both uses

akmiller01 commented 1 year ago

The first use is within a try / except block that catches UnicodeDecodeError. So if the downloader.content_as_text() statement within the return line fails to decode as unicode, the function continues along to detect the encoding, and return it along with the content in the second use.

The original function without the with_encoding argument was already detecting the encoding to return the text, it just was never returning it. So when subsequent functions needed the text and the encoding, it didn't need to use chardet again.

The flow-control of clean.clean_invalid_documents calls continue on that error handler within a for loop from above: https://github.com/IATI/refresher/blob/develop/src/library/clean.py#L165 So it will skip that iteration of the loop, and won't use the undefined iati_activities_el.

odscjames commented 1 year ago

The flow-control of clean.clean_invalid_documents calls continue on that error handler within a for loop from above: https://github.com/IATI/refresher/blob/develop/src/library/clean.py#L165 So it will skip that iteration of the loop, and won't use the undefined iati_activities_el.

But it only calls continue on a second error - where utils.get_text_from_blob crashes. If there is a document that is not UTF-8 and has an encoding such that LXML crashes but get_text_from_blob can detect a different charset wouldn't the execution path be:


            try:
                blob_service_client = BlobServiceClient.from_connection_string(
                    config['STORAGE_CONNECTION_STR']) # EXECUTES
                blob_client = blob_service_client.get_blob_client(
                    container=config['SOURCE_CONTAINER_NAME'], blob=blob_name) # EXECUTES

                downloader = blob_client.download_blob() # EXECUTES

                large_parser = etree.XMLParser(huge_tree=True) # EXECUTES
                root = etree.parse(
                    BytesIO(downloader.content_as_bytes()), parser=large_parser) # EXECUTES BUT HAS ISSUES
                iati_activities_el = root.getroot()
                file_encoding = 'utf-8'
            except (AzureExceptions.ResourceNotFoundError) as e:
                logger.warning(
                    f"Blob not found for hash: {hash} and id: {id} - updating as Not Downloaded for the refresher to pick up.")
                db.updateFileAsNotDownloaded(conn, id)
            except etree.XMLSyntaxError as e:
                logger.warning(
                    f"Cannot parse entire XML for hash: {hash} id: {id}. ") # EXECUTES
                try:
                    file_text, file_encoding = utils.get_text_from_blob(
                        downloader, hash, True) # EXECUTES AND RETURNS AN ENCODING, DOES NOT CRASH
                except:
                    logger.warning(
                        f"Can not identify charset for hash: {hash} id: {id} to remove invalid activities.")
                    db.updateCleanError(
                        conn, id, 'Could not parse')
                    continue

            # loop and save valid activities to clean doc
            activities = iati_activities_el.xpath("iati-activity") # EXECUTES BUT AT THIS POINT iati_activities_el IS IN A BAD STATE?

akmiller01 commented 1 year ago

But it only calls continue on a second error - where utils.get_text_from_blob crashes. If there is a document that is not UTF-8 and has an encoding such that LXML crashes but get_text_from_blob can detect a different charset wouldn't the execution path be:

Yes, that sounds correct. So the code could be updated to something like this to work as expected like this:

            try:
                blob_service_client = BlobServiceClient.from_connection_string(
                    config['STORAGE_CONNECTION_STR'])
                blob_client = blob_service_client.get_blob_client(
                    container=config['SOURCE_CONTAINER_NAME'], blob=blob_name)

                downloader = blob_client.download_blob()

                large_parser = etree.XMLParser(huge_tree=True)
                root = etree.parse(
                    BytesIO(downloader.content_as_bytes()), parser=large_parser)
                iati_activities_el = root.getroot()
                file_encoding = 'utf-8'
            except (AzureExceptions.ResourceNotFoundError) as e:
                logger.warning(
                    f"Blob not found for hash: {hash} and id: {id} - updating as Not Downloaded for the refresher to pick up.")
                db.updateFileAsNotDownloaded(conn, id)
                continue # ADD CONTINUE HERE FOR FILE NOT FOUND
            except etree.XMLSyntaxError as e:
                logger.warning(
                    f"Cannot parse entire XML for hash: {hash} id: {id}. ")
                try:
                    file_text, file_encoding = utils.get_text_from_blob(
                        downloader, hash, True)
                    root = etree.fromstring(file_text, parser=large_parser) # ADD LXML PARSE FILE_TEXT STRING
                    iati_activities_el = root.getroot() # ADD GETROOT
                except:
                    logger.warning(
                        f"Can not identify charset for hash: {hash} id: {id} to remove invalid activities.")
                    db.updateCleanError(
                        conn, id, 'Could not parse')
                    continue

IATI / refresher

What is utils.get_text_from_blob param with_encoding meant to do? #276