Itguru14 / tag-dssg-2023-lbc

MIT License
1 stars 1 forks source link

[FEATURE] Automated data pipeline: Google Drive --> BigQuery #1

Open bbrewington opened 1 year ago

bbrewington commented 1 year ago

Code-driven data pipeline to take data from Google Drive (mix of Google Sheets, Excel files, and folders containing those), and land it in BigQuery dataset tag-dssg-2023-lbc-all-teams.data_raw with all columns as STRING type

Once this is done, the follow-on story #2 can be started

For access to BigQuery, contact @bbrewington (TAG DSSG Slack or Email is fine)

Itguru14 commented 1 year ago

I'm working on the pipeline that takes data directly from gdrive to google query and does so as efficiently as possible, currently we have the different python scripts that takes data from gdrive to local, to GCS and finally to Google query. I'm trying to eliminate the need to transfer first to GCS before finally landing in Google Query

bbrewington commented 1 year ago

@Itguru14 made some updates, and need to polish the if __name__ == '__main__' part of datapipelines/google_drive.py

I decided to use pydrive library b/c it's one of the better ones I found (man, Google does NOT make this easy) - planning on looping through files in the folder (that are either CSV, Excel, or Google Sheets), reading file contents to Pandas DataFrame, then writing (with all cols as string) to BigQuery

If you want to use this approach, feel free to pick up in the section I commented out

For easy reference, here's the commit w/ what I just pushed: https://github.com/Itguru14/tag-dssg-2023-lbc/commit/1fbaba1cab43382d90ef3af393e038e4d292481b

Itguru14 commented 1 year ago

Ok. will do, just do whatever you can I will pickup the rest later tonite

On Wed, Jul 19, 2023 at 9:48 PM Brent Brewington @.***> wrote:

@Itguru14 https://github.com/Itguru14 made some updates, and need to polish the if name == 'main' part of datapipelines/google_drive.py

I decided to use pydrive library b/c it's one of the better ones I found (man, Google does NOT make this easy) - planning on looping through files in the folder (that are either CSV, Excel, or Google Sheets), reading file contents to Pandas DataFrame, then writing (with all cols as string) to BigQuery

If you want to use this approach, feel free to pick up in the section I commented out

— Reply to this email directly, view it on GitHub https://github.com/Itguru14/tag-dssg-2023-lbc/issues/1#issuecomment-1642984934, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASUAXVFCH3IFIN56PVRKLHLXRCE5NANCNFSM6AAAAAA2M6D3IQ . You are receiving this because you were mentioned.Message ID: @.***>