degauss-org / dht

DeGAUSS helper tools
http://degauss.org/dht
GNU General Public License v3.0
4 stars 1 forks source link

Duplicated rows after geocoding #93

Closed qing-duan closed 1 year ago

qing-duan commented 1 year ago

Some rows after geocoding are duplicated. I checked the temporary file ended with geocoder_3.2.1_score_threshold_0.5.csv and found no duplication there. It means the duplication happened during the "left_join" step.

Below is an example with the unique addresses with the duplication issue (which I pulled from the geocoded data) and the temporary geocoded addresses.

Example codes: d.dup <- read_csv("dup_addresses_for_testing.csv") dim(d.dup) [1] 128 1

d.geo <- read_csv("temp_degauss_geocoder_3.2.1_score_threshold_0.5.csv") dim(d.geo) # the original dataset also has 84787 unique addresses. So no duplication during geocoding. [1] 84787 10
out <- dplyr::left_join(d.dup, d.geo, by = "address", na_matches = "never") dim(out) # it is supposed to be 128... [1] 255 10

d.postal <- read_csv("temp_degauss_postal_0.1.3.csv") # interestingly, the output from postal container in the previous step was merged with no problem. out <- dplyr::left_join(d.dup, d.postal, by = "address", na_matches = "never") dim(out) [1] 128 14

qing-duan commented 1 year ago

Files attached: temp_degauss_geocoder_3.2.1_score_threshold_0.5.csv dup_addresses_for_testing.csv temp_degauss_postal_0.1.3.csv

cole-brokamp commented 1 year ago

I think you are right that this has to do with the left_join step. Is that code from somewhere specific? Could you show the R command that produces the unexpected duplicates? Usually the degauss_run function will make sure only a unique set of addresses gets passed to a container and that those are merged back into the raw data without duplicating.

qing-duan commented 1 year ago

I pulled the left_join() from degauss_run (see below the highlighted part). Not sure why most addresses were merged with no problem but a few were duplicated.

degauss_run function (.x, image, version = "latest", argument = NA, quiet = FALSE) { tf <- fs::filetemp(ext = ".csv", pattern = "degauss") degauss_input_names <- names(.x)[names(.x) %in% c("address", "lat", "lon", "start_date", "end_date")] readr::write_csv(unique(dplyr::select(.x, tidyselect::all_of(degauss_input_names))), tf) degauss_cmd <- make_degauss_command(input_file = basename(tf), image = image, version = version, argument = argument) degauss_cmd <- gsub("$PWD", fs::path_dir(tf), degauss_cmd, fixed = TRUE) system(degauss_cmd, ignore.stdout = quiet, ignore.stderr = quiet) out_files <- fs::dir_ls(fs::path_dir(tf), glob = paste0(fs::path_ext_remove(tf), "*.csv")) out_file <- out_files[!out_files == tf] .x_output <- suppressWarnings(readr::read_csv(file = out_file, col_types = readr::cols(lat = "d", lon = "d", census_tract_id = "c", census_tract_vintage = "c", fips_tract_id = "c", drive_time = "c", matched_zip = "c", year = "i", nlcd_year = "i", census_tract_id_2020 = "c", census_tract_id_2010 = "c", census_tract_id_2000 = "c", census_tract_id_1990 = "c", census_tract_id_1980 = "c", census_tract_id_1970 = "c", census_block_group_id_2020 = "c", census_block_group_id_2010 = "c", census_block_group_id_2000 = "c", census_block_group_id_1990 = "c", start_date = "D", end_date = "D"), show_col_types = FALSE)) out <- dplyr::left_join(.x, .x_output, by = degauss_input_names, na_matches = "never") return(out) }

From: Cole Brokamp @. Sent: Tuesday, November 15, 2022 4:54 PM To: degauss-org/dht @.> Cc: Duan, Qing @.>; Author @.> Subject: Re: [degauss-org/dht] Duplicated rows after geocoding (Issue #93)

This email originated from an EXTERNAL sender to CCHMC. Proceed with caution when replying, opening attachments, or clicking links in this message.

I think you are right that this has to do with the left_join step. Is that code from somewhere specific? Could you show the R command that produces the unexpected duplicates? Usually the degauss_run function will make sure only a unique set of addresses gets passed to a container and that those are merged back into the raw data without duplicating.

— Reply to this email directly, view it on GitHubhttps://github.com/degauss-org/dht/issues/93#issuecomment-1315909388, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYZUAR2UQ6RQJEWCNGEARVTWIQBAVANCNFSM6AAAAAASBMNRAY. You are receiving this because you authored the thread.Message ID: @.***>

cole-brokamp commented 1 year ago

Do you have the same problem if you use the degauss_run function directly instead of reading in and joining the temporary files yourself?

Is there an example degauss_run command that generates the duplicates?

qing-duan commented 1 year ago

Hi Cole,

This issue was from running the degauss_run function. All the R codes can be found in the RISEUP admission rate R markdown file and the related part is

d <- d |> degauss_run("geocoder", "3.2.1", quiet = FALSE) |> # duplicated records: n=136949 distinct(.keep_all = TRUE) # keep distinct rows: n = 135871

As you can see in the codes, I removed the duplicated records using a separate step. I tried to explore where the problem was and came up with the examples included in the github issue.

This is not a problem for now. I can explain when we meet next time. Thanks!

Qing

On Nov 15, 2022, at 5:04 PM, Cole Brokamp @.**@.>> wrote:

This email originated from an EXTERNAL sender to CCHMC. Proceed with caution when replying, opening attachments, or clicking links in this message.

Do you have the same problem if you use the degauss_run function directly instead of reading in and joining the temporary files yourself?

Is there an example degauss_run command that generates the duplicates?

— Reply to this email directly, view it on GitHubhttps://github.com/degauss-org/dht/issues/93#issuecomment-1315918652, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AYZUAR3HBU6HHBZTC4O4F3TWIQCFRANCNFSM6AAAAAASBMNRAY. You are receiving this because you authored the thread.Message ID: @.***>