OSSPhilippines / personal-well-being-dashboard

MIT License
12 stars 13 forks source link

Sai/gmail extraction #16

Closed sairilseb-me closed 4 months ago

sairilseb-me commented 4 months ago

Created a gmail extractor

philgerardsoto commented 4 months ago

Hi @sairilseb-me thanks for starting on this task. I was able to run your simple_gmail_sai.py file, and it works! I attached a screenshot of my result.

Image 5-19-24 at 8 25 PM

Some suggestions:

  1. In gmail_downloader.py under the simplegmail folder, we have a process_messages function. I suggest we improve this by incorporating your code regarding parameter filtering.

  2. Also in gmail_downloader.py, the message_data dictionary contains just a few of the fields available via the API:

id
sender
recipient
subject
plain
html
date

I suggest we include all available fields since we're using an ELT approach. We can print all available attributes of the message object, and include all of these fields in the message_data dictionary.

  1. In gmail_downloader_sample.py in the root folder, there is there is a function to generate monthly date ranges that can serve as additional filter in your parameter filtering. Please feel free to incorporate this either on gmail_downloader.py under the simplegmail folder or on a file on the root folder.

  2. It would also be good to check the parameter filtering based on domains and keywords. I was able to get hello@gcash.com emails but not no-reply@gcash.com emails.

Note for observers in this PR: you have to add an 'attachments' folder in the root directory for gmail_downloader.py to work.

Kindly reach out for any questions. Or other suggested approaches. Thank you @sairilseb-me!

sairilseb-me commented 4 months ago

Suggestion 1 done.

commit # f5f079a

sairilseb-me commented 4 months ago

Updates

philgerardsoto commented 4 months ago

Hi @sairilseb-me thank you. I mentioned this in our private chat but commenting here as well for visibility:

When I run the code locally, I still don't get the Unionbank emails. I think we should edit the query condition based on https://support.google.com/mail/answer/7190

Also, I suggest we don't edit gmail.py, and put the revision on the other files. gmail.py is based on the simplegmail repository so we want to keep it untouched as much as possible.

Thank you!

philgerardsoto commented 4 months ago

Also, code simplification suggestion by Copilot:

query = ' OR '.join([f"from: {domain}" for domain in domains] + [f"subject: {keyword}" for keyword in keywords])

philgerardsoto commented 4 months ago

Nice work @sairilseb-me! I'm now able to get both GCash and Unionbank emails.

I pushed two minor commits, and will merge this now.

Next related task to this is for us to use Airflow and/or dlt.