Closed v-anne closed 1 month ago
Your pull request is modifying functions with the following pre-existing issues:
📄 File: cl/custom_filters/templatetags/pacer.py
Function | Unhandled Issue |
---|---|
price |
TypeError: can't multiply sequence by non-int of type 'float' /api/rest/{ve... Event Count: 6 |
Did you find this useful? React with a 👍 or 👎
I took a look and this doesn't seem to break anything, but of course our existing tests didn't catch this bug, either. Please take a close look before approving.
I'm confused here. rd.page_count
is an integer field and the Sentry issue is complaining about can't multiply sequence by non-int of type 'float'
? Huh?
That's why I was confused too. I have no idea why it's throwing an error.
The item that threw the error has a page_count of 4. Very strange. I'll assign for review though.
I thought the model restricts data to be ints. And of course an int * float is a float, so there shouldn't be any issue.
I'm confused here. rd.page_count is an integer field and the Sentry issue is complaining about can't multiply sequence by non-int of type 'float'? Huh?
The item that threw the error has a page_count of 4. Very strange.
@mlissner @v-anne I believe I've identified the root cause of the issue. We're using Doctor to compute the number of pages in a document, and Doctor responses are returned as text. This means we're typically updating rd.page_count
with a numeric string (like '15') instead of an integer.
Additionally, when we use the instance
object from the post_save
signal to create the email, we're bypassing the usual data retrieval process. This is in contrast to our regular templates, where we first retrieve records from the database before passing them to the template. During this retrieval, the data undergoes the necessary type casting to ensure it's in the correct format.
I ran a test where I attempted to update the page_count
field with a numeric string ('5'). Django successfully converted the string to an integer when saving the data to the database. However, I noticed that the instance
object within the post_save
signal still contained the page_count
value as a string
. It seems like Django handles the necessary type conversion for database operations but it doesn't update the in-memory instance with the casted value.
I thought the model restricts data to be ints
You're right! Django's IntegerField is designed to only accept numbers. But it's flexible. You can give it a number directly (like 5) or even a numeric string ('5') and Django's internal mechanisms will handle the necessary type conversion ( check the get_prep_value method), ensuring that the data is stored as an integer in the database.
if you attempt to assign a value that cannot be converted to an integer, Django will raise an exception.
Wow, thanks @ERosendo. It would've taken me a while to figure that out. That means my hotfix shouldn't break the frontend, right?
That means my hotfix shouldn't break the frontend
Your code looks good. I don't think it will cause any frontend problems.
The hot fix will fix the issue, but it's the wrong way to do so, I think, because somebody in a year or two will see the cast happening here, find it redundant and remove it again.
I think a better way to do this is to get the correct value back from doctor or to cast it as an int from doctor immediately. With that done, we can add an assertion (which would be self documenting) in the save method to ensure that it's always an int at that point. Notionally:
assert isinstance(rd.page_number, int), "page_number must be an int or None."
Something like that?
I'm looking at the doctor codebase, namely doctor/tasks.py.
Is this the function that's used for computing page count?
def get_page_count(path, extension):
"""Get the number of pages, if appropriate mimetype.
:param path: A path to a binary (pdf, wpd, doc, txt, html, etc.)
:param extension: The extension of the binary.
:return: The number of pages if possible, else return None
"""
if extension == "pdf":
try:
reader = PdfReader(path)
return len(reader.pages)
except (
IOError,
ValueError,
TypeError,
KeyError,
AssertionError,
PdfReadError,
):
# IOError: File doesn't exist. My bad.
# ValueError: Didn't get an int for the page count. Their bad.
# TypeError: NumberObject has no attribute '__getitem__'. Ugh.
# KeyError, AssertionError: assert xrefstream["/Type"] == "/XRef". WTF?
# PdfReadError: Something else. I have no words.
return 0
elif extension == "wpd":
# Best solution appears to be to dig into the binary format
pass
elif extension == "doc":
# Best solution appears to be to dig into the XML of the file
# itself: http://stackoverflow.com/a/12972502/64911
pass
return None
If so, I'm confused how a str datatype is ever being returned. Shouldn't len() return an int?
I separately looked at cl/corpus_importer/tasks.py
, and saw this bit of code:
# request.content is sometimes a str, sometimes unicode, so
# force it all to be bytes, pleasing hashlib.
rd.sha1 = sha1(pdf_bytes)
response = async_to_sync(microservice)(
service="page-count",
item=rd,
)
if response.is_success:
rd.page_count = response.text
I assume I could just do int(response.text)
without touching doctor. Is that understanding correct?
Yeah, that looks like the right place to fix it. We'd just want to watch out for None, I think.
I've used PACER for a while and I've only encountered pdfs. How often do the other file formats come up?
Never, but we use Doctor for other things.
I'm afraid that change broke a handful of other tests. Any guesses for why?
Looks like the assertion fails when it's none, so that's one thing. You can just remove it, really.
The 1 != 2 error might be a new one due to a bad faker? The other I'm not sure without digging. 🤔
Fixed by allowing the assertion to accept None
. I'm embarrassed to say I thought it was already doing that. I think this is ready for review.
Thanks for the multiple different approaches and tweaks here. Merged!
This is a hotfix for #4589, which contains an issue related to variable types when communicating to users that a prayer for a document was granted.
The email template contains the price of the document, which is calculated using the same function that is used to display prices on the frontend html:
price()
incl/custom_filters/templatetags/pacer.py
.Prior to my use of this function as part of the pray and pay project, I think this function was only used for frontend rendering, which made datatypes less important. Now that the function has backend uses, however, it's an issue.
I made a simple change to the function by casting
rd.page_count
to an int. However, if this creates larger problems on the front end, I will just write a new pricing function rather than break this one, which is used on thousands, if not millions of pages.