ProjectSidewalk / sidewalk-quality-analysis

An analysis of Project Sidewalk user quality based on interaction logs
5 stars 3 forks source link

Update .csv data ingestion pipeline to accommodate both the city file and interaction file #61

Closed jonfroehlich closed 2 years ago

jonfroehlich commented 2 years ago

Ask @misaugstad why we have two separate data files for every city? Both files have a column for a user id, so shouldn't they be combined?

image

misaugstad commented 2 years ago

Mostly just because it's a little bit easier and less error prone on my end when running the queries. And if I make a mistake while getting the data together, I only make a mistake in one place instead of needing to start over with all of them. And I figure it isn't difficult to automate in the code.

jonfroehlich commented 2 years ago

Ok, I will write a join on my side then... feels cleaner to have everything on one output file, but I get your constraints!

misaugstad commented 2 years ago

@jonfroehlich I actually ended up writing a bash script yesterday that will run a query in every city and (optionally) merge the contents into a single file (with the city as a column in the CSVs). I run queries in for every city frequently enough that writing a bash script to make it easier (and less error prone) seemed like a good use of time. And I figured that I could add some bells and whistles (like merging output into one file) while I was at it. Hopefully I can upload fresh data today.

jonfroehlich commented 2 years ago

Nice. Thanks!

misaugstad commented 2 years ago

@jonfroehlich this has now been done!

jonfroehlich commented 2 years ago

Oops, I think we misunderstood each other on this one. I didn't have a problem with individual files for each city but questioned why each city had two files: i.e., one city-stats.csv file and one city-interaction-stavs.csv file.

I'm fine with keeping files separated per city if that's easiest on your end (indeed, this is what my framework currently expects, so it's barfing now on the new format would amalgamates everything into two single files with a city column).

I can also update my parsers to handle one city-stats.csv file and one city-interaction-stavs.csv file (I have not yet written my file ingestor pipeline to work with the interaction file yet, so no code is lost in figuring this out now). Just let me know.

misaugstad commented 2 years ago

wow, sorry I really just misread what you had said multiple times, my bad!! Yes I have the query that uses the "interactions" tables separate from the query that does everything else, and I think it's best that it stays that way and that you write the join to combine them yourself.

The reason that they're separated is that any time I write a query that touches the interactions tables, it ends up taking just as long (if not longer) than the other query that returns something like 100 columns. Keeping them separate makes it A LOT easier to test out new columns when I add them, to debug, etc.

jonfroehlich commented 2 years ago

No worries. Makes sense. Thanks!