Closed jonfroehlich closed 2 years ago
Mostly just because it's a little bit easier and less error prone on my end when running the queries. And if I make a mistake while getting the data together, I only make a mistake in one place instead of needing to start over with all of them. And I figure it isn't difficult to automate in the code.
Ok, I will write a join on my side then... feels cleaner to have everything on one output file, but I get your constraints!
@jonfroehlich I actually ended up writing a bash script yesterday that will run a query in every city and (optionally) merge the contents into a single file (with the city as a column in the CSVs). I run queries in for every city frequently enough that writing a bash script to make it easier (and less error prone) seemed like a good use of time. And I figured that I could add some bells and whistles (like merging output into one file) while I was at it. Hopefully I can upload fresh data today.
Nice. Thanks!
@jonfroehlich this has now been done!
Oops, I think we misunderstood each other on this one. I didn't have a problem with individual files for each city but questioned why each city had two files: i.e., one city-stats.csv
file and one city-interaction-stavs.csv
file.
I'm fine with keeping files separated per city if that's easiest on your end (indeed, this is what my framework currently expects, so it's barfing now on the new format would amalgamates everything into two single files with a city column).
I can also update my parsers to handle one city-stats.csv
file and one city-interaction-stavs.csv
file (I have not yet written my file ingestor pipeline to work with the interaction file yet, so no code is lost in figuring this out now). Just let me know.
wow, sorry I really just misread what you had said multiple times, my bad!! Yes I have the query that uses the "interactions" tables separate from the query that does everything else, and I think it's best that it stays that way and that you write the join to combine them yourself.
The reason that they're separated is that any time I write a query that touches the interactions tables, it ends up taking just as long (if not longer) than the other query that returns something like 100 columns. Keeping them separate makes it A LOT easier to test out new columns when I add them, to debug, etc.
No worries. Makes sense. Thanks!
Ask @misaugstad why we have two separate data files for every city? Both files have a column for a user id, so shouldn't they be combined?