WinVector / rquery

Data Wrangling and Query Generating Operators for R. Distributed under choice of GPL-2 or GPL-3 license.
https://winvector.github.io/rquery/
Other
109 stars 15 forks source link

Column rename not working as expected #8

Closed prescient closed 5 years ago

prescient commented 5 years ago

Long winded explanation: In the flights data set there are two columns to join the airports data two. In the below example you can see that I generate two queries to rename columns (airports_origin and airports_destination) that are joined to the flights table. However, the airports_origin table seems to replace/overwrite the airports destination query. I'm not sure if this is working as expected or potentially I've done something quite dumb.

tbl_flights <- db_td(con, 'flights')
tbl_airports <- db_td(con, 'airports')
tbl_airlines <- db_td(con, 'airlines')    

#setup the airports queries
airports_destination <- tbl_airports %>%
  rename_columns(., 
                 c('destination_airport' %:=% 'iata_code',
                 'destination_airport_name' %:=% 'airport',
                 'destination_city' %:=% 'city',
                 'destination_state' %:=% 'state',
                 'destination_country' %:=% 'country',
                 'destination_latitude' %:=% 'latitude',
                 'destination_longitude' %:=% 'longitude'))

airports_origin <- tbl_airports %>%
  rename_columns(., 
                 c('origin_airport' %:=% 'iata_code',
                   'origin_airport_name' %:=% 'airport',
                   'origin_city' %:=% 'city',
                   'origin_state' %:=% 'state',
                   'origin_country' %:=% 'country',
                   'origin_latitude' %:=% 'latitude',
                   'origin_longitude' %:=% 'longitude'))

#checking to make sure i'm not doing something dumb, but this looks ok.
tst <- execute(con, airports_origin)
tst <- execute(con, airports_destination)

#join flights with airports
qry <- tbl_flights %.>%
  select_rows_nse(.,
                  arrival_delay < 0) %.>%
  natural_join(tbl_flights, airports_destination, by = 'destination_airport') %.>%
  natural_join(tbl_flights, airports_origin, by = 'origin_airport')

SQL Output: SELECT COALESCE("tsql_56246101586635550602_0000000001"."origin_airport", "tsql_56246101586635550602_0000000002"."origin_airport") AS "origin_airport", "tsql_56246101586635550602_0000000001"."fyear" AS "fyear", "tsql_56246101586635550602_0000000001"."fmonth" AS "fmonth", "tsql_56246101586635550602_0000000001"."fday" AS "fday", "tsql_56246101586635550602_0000000001"."day_of_week" AS "day_of_week", "tsql_56246101586635550602_0000000001"."airline" AS "airline", "tsql_56246101586635550602_0000000001"."flight_number" AS "flight_number", "tsql_56246101586635550602_0000000001"."tail_number" AS "tail_number", "tsql_56246101586635550602_0000000001"."destination_airport" AS "destination_airport", "tsql_56246101586635550602_0000000001"."scheduled_departure" AS "scheduled_departure", "tsql_56246101586635550602_0000000001"."departure_time" AS "departure_time", "tsql_56246101586635550602_0000000001"."departure_delay" AS "departure_delay", "tsql_56246101586635550602_0000000001"."taxi_out" AS "taxi_out", "tsql_56246101586635550602_0000000001"."wheels_off" AS "wheels_off", "tsql_56246101586635550602_0000000001"."scheduled_time" AS "scheduled_time", "tsql_56246101586635550602_0000000001"."elapsed_time" AS "elapsed_time", "tsql_56246101586635550602_0000000001"."air_time" AS "air_time", "tsql_56246101586635550602_0000000001"."distance" AS "distance", "tsql_56246101586635550602_0000000001"."wheels_on" AS "wheels_on", "tsql_56246101586635550602_0000000001"."taxi_in" AS "taxi_in", "tsql_56246101586635550602_0000000001"."scheduled_arrival" AS "scheduled_arrival", "tsql_56246101586635550602_0000000001"."arrival_time" AS "arrival_time", "tsql_56246101586635550602_0000000001"."arrival_delay" AS "arrival_delay", "tsql_56246101586635550602_0000000001"."diverted" AS "diverted", "tsql_56246101586635550602_0000000001"."cancelled" AS "cancelled", "tsql_56246101586635550602_0000000001"."cancellation_reason" AS "cancellation_reason", "tsql_56246101586635550602_0000000001"."air_system_delay" AS "air_system_delay", "tsql_56246101586635550602_0000000001"."security_delay" AS "security_delay", "tsql_56246101586635550602_0000000001"."airline_delay" AS "airline_delay", "tsql_56246101586635550602_0000000001"."late_aircraft_delay" AS "late_aircraft_delay", "tsql_56246101586635550602_0000000001"."weather_delay" AS "weather_delay", "tsql_56246101586635550602_0000000002"."origin_airport_name" AS "origin_airport_name", "tsql_56246101586635550602_0000000002"."origin_city" AS "origin_city", "tsql_56246101586635550602_0000000002"."origin_state" AS "origin_state", "tsql_56246101586635550602_0000000002"."origin_country" AS "origin_country", "tsql_56246101586635550602_0000000002"."origin_latitude" AS "origin_latitude", "tsql_56246101586635550602_0000000002"."origin_longitude" AS "origin_longitude" FROM ( SELECT "fyear", "fmonth", "fday", "day_of_week", "airline", "flight_number", "tail_number", "origin_airport", "destination_airport", "scheduled_departure", "departure_time", "departure_delay", "taxi_out", "wheels_off", "scheduled_time", "elapsed_time", "air_time", "distance", "wheels_on", "taxi_in", "scheduled_arrival", "arrival_time", "arrival_delay", "diverted", "cancelled", "cancellation_reason", "air_system_delay", "security_delay", "airline_delay", "late_aircraft_delay", "weather_delay" FROM "flights" ) "tsql_56246101586635550602_0000000001" INNER JOIN ( SELECT "iata_code" AS "origin_airport", "airport" AS "origin_airport_name", "city" AS "origin_city", "state" AS "origin_state", "country" AS "origin_country", "latitude" AS "origin_latitude", "longitude" AS "origin_longitude" FROM ( SELECT "iata_code", "airport", "city", "state", "country", "latitude", "longitude" FROM "airports" ) tsql_56246101586635550602_0000000000 ) "tsql_56246101586635550602_0000000002" ON "tsql_56246101586635550602_0000000001"."origin_airport" = "tsql_56246101586635550602_0000000002"."origin_airport"

prescient commented 5 years ago

I also noticed that it seems to be dropping the where clause from select_rows_nse.

qry <- tbl_flights %.>%
                  select_rows(.,
                              arrival_delay < 0) 
cat(to_sql(qry, con))

Interestingly it looks fine if we only execute the above section of the query. Sql out:

SELECT * FROM (
 SELECT
  "fyear",
  "fmonth",
  "fday",
  "day_of_week",
  "airline",
  "flight_number",
  "tail_number",
  "origin_airport",
  "destination_airport",
  "scheduled_departure",
  "departure_time",
  "departure_delay",
  "taxi_out",
  "wheels_off",
  "scheduled_time",
  "elapsed_time",
  "air_time",
  "distance",
  "wheels_on",
  "taxi_in",
  "scheduled_arrival",
  "arrival_time",
  "arrival_delay",
  "diverted",
  "cancelled",
  "cancellation_reason",
  "air_system_delay",
  "security_delay",
  "airline_delay",
  "late_aircraft_delay",
  "weather_delay"
 FROM
  "flights"
) tsql_45671901082776237042_0000000000
WHERE "arrival_delay" < 0
prescient commented 5 years ago

User error.....

JohnMount commented 5 years ago

No problem, I was just getting around to looking into it. I appreciate your help and support.

prescient commented 5 years ago

As I've said before I appreciate your work on this package. I think it is a terrific given how flexible it is. To publicize it a bit I'm presenting it to a bunch of folks at my company (KPMG) tomorrow to hopefully get the word out.

JohnMount commented 5 years ago

Wow, that is great news! Thank you very much, that means a lot to our group. If you need anything please reach out, my email is jmount@win-vector.com .