Open odscjames opened 1 year ago
The first use is within a try
/ except
block that catches UnicodeDecodeError
. So if the downloader.content_as_text()
statement within the return line fails to decode as unicode, the function continues along to detect the encoding, and return it along with the content in the second use.
The original function without the with_encoding
argument was already detecting the encoding to return the text, it just was never returning it. So when subsequent functions needed the text and the encoding, it didn't need to use chardet again.
The flow-control of clean.clean_invalid_documents calls continue
on that error handler within a for
loop from above: https://github.com/IATI/refresher/blob/develop/src/library/clean.py#L165 So it will skip that iteration of the loop, and won't use the undefined iati_activities_el
.
The flow-control of clean.clean_invalid_documents calls continue on that error handler within a for loop from above: https://github.com/IATI/refresher/blob/develop/src/library/clean.py#L165 So it will skip that iteration of the loop, and won't use the undefined iati_activities_el.
But it only calls continue on a second error - where utils.get_text_from_blob crashes. If there is a document that is not UTF-8 and has an encoding such that LXML crashes but get_text_from_blob can detect a different charset wouldn't the execution path be:
try:
blob_service_client = BlobServiceClient.from_connection_string(
config['STORAGE_CONNECTION_STR']) # EXECUTES
blob_client = blob_service_client.get_blob_client(
container=config['SOURCE_CONTAINER_NAME'], blob=blob_name) # EXECUTES
downloader = blob_client.download_blob() # EXECUTES
large_parser = etree.XMLParser(huge_tree=True) # EXECUTES
root = etree.parse(
BytesIO(downloader.content_as_bytes()), parser=large_parser) # EXECUTES BUT HAS ISSUES
iati_activities_el = root.getroot()
file_encoding = 'utf-8'
except (AzureExceptions.ResourceNotFoundError) as e:
logger.warning(
f"Blob not found for hash: {hash} and id: {id} - updating as Not Downloaded for the refresher to pick up.")
db.updateFileAsNotDownloaded(conn, id)
except etree.XMLSyntaxError as e:
logger.warning(
f"Cannot parse entire XML for hash: {hash} id: {id}. ") # EXECUTES
try:
file_text, file_encoding = utils.get_text_from_blob(
downloader, hash, True) # EXECUTES AND RETURNS AN ENCODING, DOES NOT CRASH
except:
logger.warning(
f"Can not identify charset for hash: {hash} id: {id} to remove invalid activities.")
db.updateCleanError(
conn, id, 'Could not parse')
continue
# loop and save valid activities to clean doc
activities = iati_activities_el.xpath("iati-activity") # EXECUTES BUT AT THIS POINT iati_activities_el IS IN A BAD STATE?
But it only calls continue on a second error - where utils.get_text_from_blob crashes. If there is a document that is not UTF-8 and has an encoding such that LXML crashes but get_text_from_blob can detect a different charset wouldn't the execution path be:
Yes, that sounds correct. So the code could be updated to something like this to work as expected like this:
try:
blob_service_client = BlobServiceClient.from_connection_string(
config['STORAGE_CONNECTION_STR'])
blob_client = blob_service_client.get_blob_client(
container=config['SOURCE_CONTAINER_NAME'], blob=blob_name)
downloader = blob_client.download_blob()
large_parser = etree.XMLParser(huge_tree=True)
root = etree.parse(
BytesIO(downloader.content_as_bytes()), parser=large_parser)
iati_activities_el = root.getroot()
file_encoding = 'utf-8'
except (AzureExceptions.ResourceNotFoundError) as e:
logger.warning(
f"Blob not found for hash: {hash} and id: {id} - updating as Not Downloaded for the refresher to pick up.")
db.updateFileAsNotDownloaded(conn, id)
continue # ADD CONTINUE HERE FOR FILE NOT FOUND
except etree.XMLSyntaxError as e:
logger.warning(
f"Cannot parse entire XML for hash: {hash} id: {id}. ")
try:
file_text, file_encoding = utils.get_text_from_blob(
downloader, hash, True)
root = etree.fromstring(file_text, parser=large_parser) # ADD LXML PARSE FILE_TEXT STRING
iati_activities_el = root.getroot() # ADD GETROOT
except:
logger.warning(
f"Can not identify charset for hash: {hash} id: {id} to remove invalid activities.")
db.updateCleanError(
conn, id, 'Could not parse')
continue
The first use inside the function makes it return the content as bytes, and assuming 'utf-8'
The second use inside the function returns back to the user the charset used, but surely because of the first use the second use is never called?
The only time the function is actually called with the param set to True is in an error handler of clean.clean_invalid_documents and I don't understand that error handler, as it could end up leaving iati_activities_el undefined and then the very next line tries to use that variable.
https://github.com/IATI/refresher/commit/b011174b15c8ab5994fe3a039a12d3924306254f adds both uses