Open tonybaloney opened 3 years ago
I'm having the same issue but simply using the JSON Library instead of pandas.
I assume this is related to the seekable() check
+1: This is also an issue when trying to use pyPDF2 's PdfFileReader class: https://pythonhosted.org/PyPDF2/PdfFileReader.html
I know this is late to the party and a slightly different case, but in case it helps anybody:
Follow up: I understand now that given the file comes over the network, it can't be seekable. 💡
My solution therefore is to use a io.BytesIO stream as an intermediary (BytesIO since I'm working with pdfs - for the original poster it may need to be io.StringIO.)
The purpose of the code is to read a pdf, allow the library pyPDF2 to extract the first page, and save that page to the output binding. Similarly to pandas in the original post, pyPDF2 requires seekable streams to work.
import logging
import azure.functions as func
from PyPDF2 import PdfFileReader, PdfFileWriter # library that requires seekable streams
from io import BytesIO
def main(fullpdf: func.InputStream, page1: func.Out[func.InputStream]):
logging.info(f"Python blob trigger function processed blob \n"
f"Name: {fullpdf.name}\n"
f"Blob Size: {fullpdf.length} bytes")
pdfbytes = fullpdf.read() # returns bytes
logging.info(f"pdfbytes: {type(pdfbytes)}")
# create intermediary input stream
fullpdfstream = BytesIO(pdfbytes) # returns seekable stream
logging.info(f"fullpdfstream: {type(fullpdfstream)}")
pdf = PdfFileReader(fullpdfstream)
pages = [0]
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
# create intermediary output stream
page1stream = BytesIO() # empty stream object that pyPDF2 class method can write to
logging.info(f"page1stream: {type(page1stream)}")
pdfWriter.write(page1stream)
logging.info(f"page1stream.getvalue: {type(page1stream.getvalue())}") # returns bytes
page1.set(page1stream.getvalue()) # set the func.Out[func.InputStream] object to stream
# out the bytes to blob storage
page1stream.close()
comments welcome if this can be optimised or improved!
Piggybacking on this issue to mention that I'm getting a different error while trying to read a CSV using pandas 1.4.2:
import pandas as pd
import azure.functions as func
def main(file: func.InputStream):
df = pd.read_csv(file)
# use DataFrame...
☝️ code like this produces this error: Exception: UnsupportedOperation: read1
Not sure if the different file type makes this a totally different issue, or if they're linked. A fix for this seems like something that could potentially help functions working on very large input files to keep their memory footprint smaller. I am able to pass BytesIO(file.read())
in and continue working, but the resulting memory profile after calling file.read
is pretty large on sufficiently large blobs.
In this example:
Configured with function.json as:
The script fails because
InputStream
does not implement the seek function. It does implementseakable()
as returning false, which is correct, but pandas doesn't adhere to that API so you get this error:Technically, this is a bug in pandas, because it should check
seekable()
first.Is there a way to make the inputstream seekable if it comes from blob storage?