freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
550 stars 151 forks source link

fix(favorites): fixing RECAP price bug in pray-and-pay emails #4592

Closed v-anne closed 1 month ago

v-anne commented 1 month ago

This is a hotfix for #4589, which contains an issue related to variable types when communicating to users that a prayer for a document was granted.

The email template contains the price of the document, which is calculated using the same function that is used to display prices on the frontend html: price() in cl/custom_filters/templatetags/pacer.py.

Prior to my use of this function as part of the pray and pay project, I think this function was only used for frontend rendering, which made datatypes less important. Now that the function has backend uses, however, it's an issue.

I made a simple change to the function by casting rd.page_count to an int. However, if this creates larger problems on the front end, I will just write a new pricing function rather than break this one, which is used on thousands, if not millions of pages.

sentry-io[bot] commented 1 month ago

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: cl/custom_filters/templatetags/pacer.py

Function Unhandled Issue
price TypeError: can't multiply sequence by non-int of type 'float' /api/rest/{ve...
Event Count: 6

Did you find this useful? React with a 👍 or 👎

v-anne commented 1 month ago

I took a look and this doesn't seem to break anything, but of course our existing tests didn't catch this bug, either. Please take a close look before approving.

mlissner commented 1 month ago

I'm confused here. rd.page_count is an integer field and the Sentry issue is complaining about can't multiply sequence by non-int of type 'float'? Huh?

v-anne commented 1 month ago

That's why I was confused too. I have no idea why it's throwing an error.

mlissner commented 1 month ago

The item that threw the error has a page_count of 4. Very strange. I'll assign for review though.

v-anne commented 1 month ago

I thought the model restricts data to be ints. And of course an int * float is a float, so there shouldn't be any issue.

ERosendo commented 1 month ago

I'm confused here. rd.page_count is an integer field and the Sentry issue is complaining about can't multiply sequence by non-int of type 'float'? Huh?

The item that threw the error has a page_count of 4. Very strange.

@mlissner @v-anne I believe I've identified the root cause of the issue. We're using Doctor to compute the number of pages in a document, and Doctor responses are returned as text. This means we're typically updating rd.page_count with a numeric string (like '15') instead of an integer.

Additionally, when we use the instance object from the post_save signal to create the email, we're bypassing the usual data retrieval process. This is in contrast to our regular templates, where we first retrieve records from the database before passing them to the template. During this retrieval, the data undergoes the necessary type casting to ensure it's in the correct format.

I ran a test where I attempted to update the page_count field with a numeric string ('5'). Django successfully converted the string to an integer when saving the data to the database. However, I noticed that the instance object within the post_save signal still contained the page_count value as a string. It seems like Django handles the necessary type conversion for database operations but it doesn't update the in-memory instance with the casted value.

I thought the model restricts data to be ints

You're right! Django's IntegerField is designed to only accept numbers. But it's flexible. You can give it a number directly (like 5) or even a numeric string ('5') and Django's internal mechanisms will handle the necessary type conversion ( check the get_prep_value method), ensuring that the data is stored as an integer in the database.

if you attempt to assign a value that cannot be converted to an integer, Django will raise an exception.

v-anne commented 1 month ago

Wow, thanks @ERosendo. It would've taken me a while to figure that out. That means my hotfix shouldn't break the frontend, right?

ERosendo commented 1 month ago

That means my hotfix shouldn't break the frontend

Your code looks good. I don't think it will cause any frontend problems.

mlissner commented 1 month ago

The hot fix will fix the issue, but it's the wrong way to do so, I think, because somebody in a year or two will see the cast happening here, find it redundant and remove it again.

I think a better way to do this is to get the correct value back from doctor or to cast it as an int from doctor immediately. With that done, we can add an assertion (which would be self documenting) in the save method to ensure that it's always an int at that point. Notionally:

assert isinstance(rd.page_number, int), "page_number must be an int or None."

Something like that?

v-anne commented 1 month ago

I'm looking at the doctor codebase, namely doctor/tasks.py. Is this the function that's used for computing page count?

def get_page_count(path, extension):
    """Get the number of pages, if appropriate mimetype.

    :param path: A path to a binary (pdf, wpd, doc, txt, html, etc.)
    :param extension: The extension of the binary.
    :return: The number of pages if possible, else return None
    """
    if extension == "pdf":
        try:
            reader = PdfReader(path)
            return len(reader.pages)
        except (
            IOError,
            ValueError,
            TypeError,
            KeyError,
            AssertionError,
            PdfReadError,
        ):
            # IOError: File doesn't exist. My bad.
            # ValueError: Didn't get an int for the page count. Their bad.
            # TypeError: NumberObject has no attribute '__getitem__'. Ugh.
            # KeyError, AssertionError: assert xrefstream["/Type"] == "/XRef". WTF?
            # PdfReadError: Something else. I have no words.
            return 0

    elif extension == "wpd":
        # Best solution appears to be to dig into the binary format
        pass
    elif extension == "doc":
        # Best solution appears to be to dig into the XML of the file
        # itself: http://stackoverflow.com/a/12972502/64911
        pass
    return None

If so, I'm confused how a str datatype is ever being returned. Shouldn't len() return an int?

I separately looked at cl/corpus_importer/tasks.py, and saw this bit of code:

    # request.content is sometimes a str, sometimes unicode, so
    # force it all to be bytes, pleasing hashlib.
    rd.sha1 = sha1(pdf_bytes)
    response = async_to_sync(microservice)(
        service="page-count",
        item=rd,
    )
    if response.is_success:
        rd.page_count = response.text

I assume I could just do int(response.text) without touching doctor. Is that understanding correct?

mlissner commented 1 month ago

Yeah, that looks like the right place to fix it. We'd just want to watch out for None, I think.

v-anne commented 1 month ago

I've used PACER for a while and I've only encountered pdfs. How often do the other file formats come up?

mlissner commented 1 month ago

Never, but we use Doctor for other things.

v-anne commented 1 month ago

I'm afraid that change broke a handful of other tests. Any guesses for why?

mlissner commented 1 month ago

Looks like the assertion fails when it's none, so that's one thing. You can just remove it, really.

The 1 != 2 error might be a new one due to a bad faker? The other I'm not sure without digging. 🤔

v-anne commented 1 month ago

Fixed by allowing the assertion to accept None. I'm embarrassed to say I thought it was already doing that. I think this is ready for review.

mlissner commented 1 month ago

Thanks for the multiple different approaches and tweaks here. Merged!