Setup audience analysis

caldwellst commented 2 weeks ago

Code used to do the audience analysis. I think you can just get rid of the plotting work (maybe just save in the Google Drive for posterity). Then just understand what the data frames generated are and do, and save those to the Azure cloud in a new folder, maybe something like audience. Shouldn't be very hard, and then all you need is to setup a weekly analysis on the Monday or something.

zackarno commented 2 weeks ago

I'll just take this branch on. I may leave some comments on the code from seth mainly just as notes for myself

zackarno commented 2 weeks ago

@martinig94 -- I added a GHA here: https://github.com/OCHA-DAP/hdx-signals/actions/workflows/user_audience_analysis.yml.

This currently creates a data.frame with user-analytics data and saves it to a csv on the dev blob. Once this is merged to main it should run every Monday at 8AM.
The csv is on the blob from GHA run I did manually though workflow_dispatch (HS_LOCAL= FALSE).
I guess we should check w/ Ed if this what there team needs next week?

Remaining decisions

what to do w/ src/email/mailchimp/audience_analysis.R file.
- The important part of that code to create the df/csv was scraped out and put in src/email/mailchimp/create_user_analytics_dataset.R, the remaining code is mostly to make the plots.
- Seth suggested deleting it and saving code to gdrive for posterity. This makes sense as it is more of an adhoc script with plots that does not fit into the overall system/architecture. This being said, I feel like there is a high probability someone will want us to re-run that analysis and update the plots at some point (soon) and that will be annoying to dig up and run.
- I guess ideally there would be a place in the repo for adhoc notebooks where we could keep this, but i appreciate that neither this or it's current script location really aligns with the repo structure as we have it now. My inkling would just be to create this adhoc folder in root and put this script there to avoid extra time on updating these plots in the future. We can always restructure and delete the folder at a later point

martinig94 commented 1 week ago

Hey @zackarno, Apologies for the late reply, I missed your comment. Yes I agree in creating an adhoc folder for analysis that needs to be run once in a while without removing the file from the repo.

zackarno commented 1 week ago

Hey @zackarno, Apologies for the late reply, I missed your comment. Yes I agree in creating an adhoc folder for analysis that needs to be run once in a while without removing the file from the repo

no worries, i've added the folder and changed the PR from draft to "ready for review"

zackarno commented 1 week ago

I was thinking about how to best log changes to database. I was going to just read in the new data set and compare to old. If anything new in old append it on. However, I noticed an example where the user changed iso2. Wonder how this should be reflected. Since we don't really know how this data will be used i think we should just write out a new file (csv) 1x per week and from that we can always retroactively merge when we understand exactly what's needed?

martinig94 commented 1 week ago

I was thinking about how to best log changes to database. I was going to just read in the new data set and compare to old. If anything new in old append it on. However, I noticed an example where the user changed iso2. Wonder how this should be reflected. Since we don't really know how this data will be used i think we should just write out a new file (csv) 1x per week and from that we can always retroactively merge when we understand exactly what's needed?

Hey Zack, yes that's definitely an option, otherwise you could just append the dataframe with only the rows changed from the previous version adding a column extraction_date so everything is already merged in one unique file and the dataset can be filtered as the need arises without creating too many new rows at every iteration. So you would have the exact same rows if there were no changes between two weeks and you would have one row more if user X modified one of the interests and the date associated would be the date in which the scripts run. I think I would prefer this option compared to having a new file generated every week, but both options are good to me!

OCHA-DAP / hdx-signals

Setup audience analysis #253

Remaining decisions