freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
529 stars 144 forks source link

Hand off recap.email to @albertisfu #1901

Closed albertisfu closed 2 years ago

albertisfu commented 2 years ago

These are my questions for now about the Todo tasks in the project:

Create celery task for processing emails

When lambda POST to /api/rest/v3/recap-email/
EmailProcessingQueueViewSet process the POST request and triggers: do_recap_document_fetch that parses the email using S3NotificationEmail and obtain docket_entries So a notification email could have many docket entries or just one? For each docket entry, it should be one recap document?

API key based permission set for Recap Email & Lambda Functions
From what I saw, right now the API doesn’t need authentication, so it’s necessary to add an auth method through a Bearer Token and use RECAPUploaders permissions, is that right?

Create homepage for recap.email
Is this page going to live outside Courtlistener?
What’s sealed content?

Develop scheme for each user to have an recap.email email address
I saw that you already have the signal to create the recap email based on username, this signal is triggered when a new user is created.
Is it already considered how to assign the recap.email for existing users?

albertisfu commented 2 years ago

Some updates:

mlissner commented 2 years ago

OK, sorry to leave you hanging here, Alberto. @tewen, if you can check my work on these questions from Alberto, that'd be great too.

A few replies:

So a notification email could have many docket entries or just one?

I think just one for now. IIRC, the reason for it using an array is so that the JSON format could be the same as for other types of objects coming from PACER.

Where can I find a real example of a notification email?

The example of an NEF that you found is good. You can find others in the test fixtures for Juriscraper, too:

https://github.com/freelawproject/juriscraper/tree/main/tests/examples/pacer/nef

I couldn’t find the code where you get the PDF document

Looks like you found fetch_pacer_doc_by_rd. I think that should work, but it needs to be upgraded to use the one-time "magic" links, as mentioned here: https://github.com/freelawproject/courtlistener/issues/1708#issuecomment-915632763

So with the examples of notification emails, I think I’ll be able to understand better about the magic link and how to handle it.

Yes. We don't have any usable magic links, because everybody always...uses them, and they're only usable once. That said, when we have all our other code ready to go, we have a partner lined up that is willing to turn on this system. When he does, I think we should be ready to race ahead to get his content either with code or manually. It's going to be a bit like debugging code in production since there's no other way to get unused magic links.

I couldn’t find the code to resend the email notification to the real user email account with recap file attached

Yes, this needs to be done.

Do you have an example of the email to be sent to the real user email after getting the PDF?

No, though we have a few other emails that we send, so you could use those as starting places. Our emails are not very pretty though, so if you know tools for making better (compliant) HTML emails, I'd be happy to consider them.

From what I saw, right now the API doesn’t need authentication, so it’s necessary to add an auth method through a Bearer Token and use RECAPUploaders permissions, is that right?

Precisely.

Is [https://recap.email] going to live outside Courtlistener?

Probably the best thing to do here is to set up recap.email to redirect to a page on https://free.law/. That'd keep all our products in a single place, and it'd make sure that people that saw the emails could figure out what's going on with them. I had grander visions of doing a whole website for recap.email, but it feels excessive in hindsight.

Is it already considered how to assign the recap.email for existing users?

It's considered and completed (by hand). I just closed the issue. :)

OK, I think that's it. Back to you Alberto!

albertisfu commented 2 years ago

Perfect, thank you for your answers @mlissner

So about the one-time link and PDF download, just to confirm, I assume that a link like this: https://ecf.ared.uscourts.gov/doc1/02715212035?caseid=120574&de_seq_num=83&magic_num=99963705 Would open a page with an IFRAME like the described here: https://github.com/freelawproject/courtlistener/issues/1708#issuecomment-915082412 This IFRAME shows the PDF that is the one that needs to be downloaded, is that right?

mlissner commented 2 years ago

...The first task...

Before you start building more functionality into it, I wonder if you'd benefit from testing what's there and making sure it works. The stuff that has landed so far came via a number of small(ish) PRs, and hasn't really been tested very well.


About the attached file, I’m not familiar with the average size of a PACER PDF, so what file size threshold do you think would be good to decide if attaching the file or not?

Yeah, this is a good question. I guess I'd say if it's greater than 750kb we should do it as a link and if smaller, then do it as an attachment? That should separate scanned docs so they don't get attached, but I mostly pulled that number out of my hat.

So if a PDF file size is over the threshold we define I assume instead of attaching the file we would put the S3 link to be downloaded, is that right?

Exactly.

The fourth task I’ll be working on is the recap page

That sounds good. We can get into that later.

albertisfu commented 2 years ago

@mlissner as we were commenting about django-ses it provides some webhook events for bounce email, complaint received, send email, delivery email, open email, click received.

To receive these notifications it's necessary to use Amazon SNS service additionally to Amazon SES.

There are two types of bounce email:

mlissner commented 2 years ago

Should we handle these notifications in Courtlistener to stop sending recap email notifications to problematic addresses like bounced or complained to avoid affecting recap.email sender reputation on SES?

Probably, yes.

If so, should we ban those problematic email addresses after one event (bounce or complaint) or should we set a threshold?

There's probably a smart way to do this, since email hosts are not always the most reliable and networks sometimes fail. Probably we focus only on hard bounces since the soft ones seem to be handled by SES.

We wouldn't want to ban somebody forever if their host was offline for 24 hours. Maybe we do exponential backoff sending? The first time it hard bounces, we don't send another for five minutes, then 10, then 20, then 40, 80, 160, 320, etc. If any goes through, we reset it to send everything? There might be more thoughtful ways of doing this and implementing it might be a pain. It's probably also worth showing a banner to the person if they're logged in that says they've got a problem that needs to be fixed?

johnhawkinson commented 2 years ago

There's probably a smart way to do this, since email hosts are not always the most reliable and networks sometimes fail. Probably we focus only on hard bounces since the soft ones seem to be handled by SES.

We wouldn't want to ban somebody forever if their host was offline for 24 hours. Maybe we do exponential backoff sending? The first time it hard bounces, we don't send another for five minutes, then 10, then 20, then 40, 80, 160, 320, etc. If any goes through, we reset it to send everything? There might be more thoughtful ways of doing this and implementing it might be a pain.

I don't think any of this makes sense? The email subsystem (here Amazon SES) keeps track of backoff and retries and redelivery and all that. Just let Amazon worry about all that stuff.

It's probably also worth showing a banner to the person if they're logged in that says they've got a problem that needs to be fixed?

Sure, definitely a good thing to do.

mlissner commented 2 years ago

I don't think any of this makes sense? The email subsystem (here Amazon SES) keeps track of backoff and retries and redelivery and all that. Just let Amazon worry about all that stuff.

The idea is to let SES do the backoff on softfails, but when it reaches a hardfail, then what? Do you keep sending emails to that address, or do you give up or what? If you give up, for how long?

johnhawkinson commented 2 years ago

I'm sorry, I misunderstood the proposal to be for us to handle the soft failure in some way.

As for hard failures, I have all too often (okay, once is enough) been the victim of multi-day email system outages beyond my control (e.g. over a long weekend) and been truly annoyed at the mailing lists and services that silently unsubscribed me as a result. The reality is that permanent failures are not always "permanent."

I think this really depends on Amazon's policies. In a perfect world, we would ignore hard failures and carry on. But if Amazon construes that as bad behavior and punishes us for it, then we should apply some mitigation.

mlissner commented 2 years ago

Yeah, so I think that lends itself towards something like an exponential backoff on hard fails, with a max of something like five or maybe ten days. I think any email outage greater than that would be such a crisis that us not sending things would be the least of their worries.

It's not just about amazon's policies around this either. If we're sending a lot of hard fails to Gmail, they may be more likely to score us as spam as a result, regardless of AWS's situation.

albertisfu commented 2 years ago

I've been playing with django-ses and reading documentation about SES notifications through AWS SNS, there are some Django signals to receive SNS notifications for events like Bounces, Complaint, Send, Delivery, Open and Click.

There are three types of bounces: Undetermined, Permanent and Transient and some sub-types: https://docs.aws.amazon.com/ses/latest/dg/notification-contents.html#bounce-object

Bounce notifications are for hard bounces (Permanent), and for soft bounces that SES stopped trying to deliver (Undetermined or Transient).

Hard Bounces

Soft Bounces

It's a best practice that you maintain a bounce rate under 5% and a complaint rate under 0.1%.

In SES panel exist a Suppression list, by default SES put in here all Permanent bounced email and complaints, but it's possible to change this configuration to put inside just bounced emails or complained emails, it's also possible to deactivate this suppression list, but I think it could be a good a second layer of protection to avoid sending email to 'Permanent' bounce email and complaints.

Taking into consideration previous data I think we should consider:

About not hard bounces as Undetermined and Transient bounces, we might define how to handle each type of bounces:

mlissner commented 2 years ago

This is a great summary, thanks Alberto. Let's talk tomorrow. I think my reaction is to mostly trust AWS's guidance rather than second-guess it.

mlissner commented 2 years ago

Topic above continued in: https://github.com/freelawproject/courtlistener/issues/1942. I'm going to close this issue because it's hitting too many topics and we can split them off in their own more focused threads.