Closed sairilseb-me closed 4 months ago
Hi @sairilseb-me thanks for starting on this task. I was able to run your simple_gmail_sai.py
file, and it works! I attached a screenshot of my result.
Some suggestions:
In gmail_downloader.py
under the simplegmail folder, we have a process_messages
function. I suggest we improve this by incorporating your code regarding parameter filtering.
Also in gmail_downloader.py
, the message_data
dictionary contains just a few of the fields available via the API:
id
sender
recipient
subject
plain
html
date
I suggest we include all available fields since we're using an ELT approach. We can print all available attributes of the message object, and include all of these fields in the message_data
dictionary.
In gmail_downloader_sample.py
in the root folder, there is there is a function to generate monthly date ranges that can serve as additional filter in your parameter filtering. Please feel free to incorporate this either on gmail_downloader.py
under the simplegmail folder or on a file on the root folder.
It would also be good to check the parameter filtering based on domains and keywords. I was able to get hello@gcash.com emails but not no-reply@gcash.com emails.
Note for observers in this PR: you have to add an 'attachments' folder in the root directory for gmail_downloader.py
to work.
Kindly reach out for any questions. Or other suggested approaches. Thank you @sairilseb-me!
Suggestion 1 done.
save_attachments
is commentedcommit # f5f079a
gmail.py
to have additional parameter start_date
and end_date
and concat to query
Hi @sairilseb-me thank you. I mentioned this in our private chat but commenting here as well for visibility:
When I run the code locally, I still don't get the Unionbank emails. I think we should edit the query condition based on https://support.google.com/mail/answer/7190
Also, I suggest we don't edit gmail.py, and put the revision on the other files. gmail.py is based on the simplegmail repository so we want to keep it untouched as much as possible.
Thank you!
Also, code simplification suggestion by Copilot:
query = ' OR '.join([f"from: {domain}" for domain in domains] + [f"subject: {keyword}" for keyword in keywords])
Nice work @sairilseb-me! I'm now able to get both GCash and Unionbank emails.
I pushed two minor commits, and will merge this now.
Next related task to this is for us to use Airflow and/or dlt.
Created a gmail extractor