brad-cannell / detect_pilot_test_1y

Detection of Elder abuse Through Emergency Care Technicians 1-year Pilot Study
https://brad-cannell.github.io/detect_pilot_test_1y/
Other
1 stars 1 forks source link

Merge the APS and MedStar datasets into a single dataset for analysis #33

Closed mbcann01 closed 1 year ago

mbcann01 commented 1 year ago

Overview

Compliations

Software

Here is a list of software packages we have tried already with mixed results.

Tasks

Depending on how involved each of these tasks are, and on Morri's workflow, it may make sense to break some of these off into their own separate issues.

corvidfox commented 1 year ago

Do we want to do any linkage of the MedStar EPCR and MedStar Compliance Data? There does seem to be a linking identifier in both data sets (Response Numbers) which could make reconciliation relatively less complicated than the overall APS/MedStar merge. For that reason, it's something I personally think could happen either before or after any big APS/MedStar merge.

corvidfox commented 1 year ago

Once the data is uploaded and in an accessible format, I'd like to do some exploratory coding to check to see if I can help with the memory and time management of things.

From what I could see of the code, it looked like it was written by someone who was R-native. DataFrames/Tibbles aren't a part of base Python, unlike R, and there's some finesse to memory management and optimization. However, Python does have a lot of analytical strengths that have made it a leader in Machine Learning (though Julia is becoming a bigger competitor). There may be something I can do with one of the machine learning packages.

Dedupe does have some point-and-click, but it produces a JSON file that is the results of the training that can be used to recreate the same results. So it has reproducibility built in 🎉. The idea is that you might train on a smaller data set, then apply the results of that training to a larger one. It's based on human judgement, which is both a strength and a limitation. That would need to be standardized, and a threshold for the number of "matches" and "rejections" for a successful training is a methodological consideration. I can look into this more and let you know exactly how it works, if it looks like the best option. However, it DOES NOT automatically condense anything, it just adds additional adjacent rows and identifiers. So, it was almost exponentially growing when the PANDAS DataFrame wasn't already memory optimized. Plus, you would still have to manually sort through the potential duplicates.

There are a LOT of different Python options. Once I can actually see the size of the various elements in the data sets, and see if any cleaning/recoding helps optimize memory, I could give a better idea of what approaches are potentially viable or not.

mbcann01 commented 1 year ago

@corvidfox , I guess it doesn't hurt to go ahead and join the compliance data to the EPCR data. I'm not sure if we will end up using it, but it doesn't sound like a heavy lift.

mbcann01 commented 1 year ago

@corvidfox Thank you for all of the info on Python Dedupe! I look forward to seeing what you figure out!

corvidfox commented 1 year ago

@mbcann01 So I've been looking at fastLink's source code. If I'm understanding things correctly, it's effective for the project purpose because it enables multithreading/multiprocessing of the data in as memory-optimized of a fashion as I think R is capable of, what with using atomic vectors and matrices and "chunking" the data rather efficiently. It also frequently calls for garbage collection after each step, and keeps its variables rather contained to both minimize the amount of memory necessary for each step and maximize the amount of memory released after each step completes.

Addressing the issue posted to fastLink

I can't see any issue with your potential solution, maybe a memory concern? But I doubt there's any actual issues - it seems to be built in for convenience with the assumption that you are intending only to use it to dedupe a single data set, rather than wanting the confusion matrix. For convenience, this is the code snippet you highlighted:

if (identical(dfA, dfB)) {
    cat("dfA and dfB are identical, assuming deduplication of a single data set.\nSetting return.all to FALSE.\n\n")
    dedupe.matches <- FALSE
    return.all <- FALSE
    dedupe.df <- TRUE
  }

The "problem variable" that cuts the amount of posterior probabilities when return.all=False is threshold.match, which can be manually set in passing into the function with values from 0-1 with a default of 0.85. I don't see how threshold.match=0.0 wouldn't return all values, as return.all=TRUE itself only sets threshold.match=0.001.

The loss of return.all=TRUE has 2 main effects, which might be "snipped" from the code for our purpose:

  1. The loss of return.all=TRUE means we don't trigger class(out) <- c("fastLink", "confusionTable")
  2. The addition of dedupe.df=TRUE means we do trigger class(out) <- c(class(out), "fastLink.dedupe")

For convenience, the original fastlink_out code you posted in the issue posted:

fastlink_out <- fastLink::fastLink(
  dfA = df_unique_combo,
  dfB = df_unique_combo,
  varnames = c("nm_first", "nm_last", "birth_mnth", "birth_year", "add_num", "add_street"),
  stringdist.match = c("nm_first", "nm_last", "add_street"),
  numeric.match = c("birth_mnth", "birth_year", "add_num"),
  dedupe.matches = FALSE,
  return.all = TRUE
)

This would cause fastlink_out to inherit from classes ("fastLink","fastLink.dedupe") since dedupe.df=TRUE from the "identical catch", and threshold.matches=0.001 from the original return.all=TRUE that was overridden.

What about:

fastlink_out <- fastLink::fastLink(
  dfA = df_unique_combo,
  dfB = df_unique_combo,
  varnames = c("nm_first", "nm_last", "birth_mnth", "birth_year", "add_num", "add_street"),
  stringdist.match = c("nm_first", "nm_last", "add_street"),
  numeric.match = c("birth_mnth", "birth_year", "add_num"),
  dedupe.matches = FALSE,
  return.all = FALSE,
  threshold.match = 0.0
)

class(fastlink_out <- c(class(fastlink_out),"confusionTable")

This might give us a similar result to if that catch did not exist, without altering the source code of fastLink.

My own issues trying to run fastLink from the example data

That being said, when I attempted to test this theory with R using the sample data you provided in the posted issue, I got an odd error that I don't think I have the theoretical understanding to troubleshoot or understand.

For convenience, the sample data you'd posted in the issue:

df <- tibble(
  incident   = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008),
  nm_first   = c("john", "john", "jane", "jon", "jane", "joy", "michael", "amy"),
  nm_last    = c(rep("smith", 7), "jones"),
  sex        = c("m", "m", "f", "m", "f", "f", "m", "f"),
  birth_mnth = c(9, 9, 2, 9, 3, 8, 9, 1),
  birth_year = c(1936, 1936, 1937, 1936, 1937, 1941, 1936, 1947),
  add_num    = c(101, 101, 14, 101, 14, 101, 101, 1405),
  add_street = c("main", "main", "elm", "main", "elm", "main", "main", "texas")
) %>% 
  mutate(row = row_number()) %>% 
  select(row, everything()) %>% 
  print()

df_unique_combo <- df %>% 
  select(-row) %>% 
  mutate(group = paste(nm_first, nm_last, birth_year, birth_mnth, add_num, add_street, sep = "_")) %>%
  group_by(group) %>% 
  filter(row_number() == 1) %>% 
  ungroup()

When running the following code:

fl_test_out <- fastLink::fastLink(
  dfA = df_unique_combo, 
  dfB = df_unique_combo,
  varnames = c("nm_first", "nm_last", "birth_mnth", "birth_year", "add_num", "add_street"),
  stringdist.match = c("nm_first", "nm_last", "add_street"),
  numeric.match = c("birth_mnth", "birth_year", "add_num"),
  verbose = True #for troubleshooting
)

I received this error: Image

In troubleshooting, I was able to find that the function gammaCK2par did not seem to recognize identical values of num_first with the default cut.a=94 (but did for a cut.a of 0.92 or less), while for num_last it did not recognize identical values at all unless cut.a=0, which seemed untenable. This may comes down to me not fully understanding the theory of Jaro-Winkler, but I don't see why identical values weren't matched without reducing the cut.a value.

The big error that stopped everything was coming from gammaNUMCK2par, which seems to be attempting to access outside of the matrix boundaries in how it's communicating to foreach to repeatedly execute a function. The matrix in question should theoretically be organizing the values of the variable being processed into columns representing each unique value in the variable. I'm honestly not sure how to fix that at this point.

corvidfox commented 1 year ago

@mbcann01 I managed to get a lot of my "roadblocks" fixed. Part of my issue was some sort of .dll permission issue with rlang of all things - but like I said, I got it fixed.

There were two major issues I've continued to experience with small datasets:

  1. fastLink simultaneously telling me I had no variation in the entries for a variable, and also no identical entries for a variable (since those are mutually exclusive, that's definitely an error)
  2. gammaNUMCK2par attempting to index a matrix outside of its range.

This seems to be isolated only to small datasets - which is why it flagged for your sample data of 7 rows, but not for their sample data of 510 rows. My guess is that it's an edge-case situation.

BUT! That's a non-issue with a large data set, which ours are (of course, hence our issues).

I had an idea that I thought was a bit dumb, but it seems to work: when passing two identical dataframes, you can simply append a "junk row" of missing values to make the "second" dataframe "non-identical." This completely circumvents the "identical dataframe" check that tries to "help."

Using the sample data that fastLink posted (with data types modified by me):

data(samplematch)
dfA <- rbind(dfA, dfA[sample(1:nrow(dfA), 10, replace = FALSE),])
dfA %>% mutate(across(where(is.factor),as.character))-> dfA
dfA$housenum <-as.numeric(dfA$housenum)

I was able to run:

fl_out<- fastLink(
  dfA = dfA, 
  dfB = rbind(dfA,NA),
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "lastname","streetname", "city"),
  numeric.match = c("birthyear","housenum"),
  dedupe.matches = FALSE,
  return.all = TRUE,
  threshold.match = 0,
  # verbose = TRUE # for troubleshooting and times
)

And execute your beautifully written alternative to getMatches to see all posterior probabilities:

fmr_fastlink_stack_matches(fl_out,dfA)

I repeated this on the ePCR data from MedStar, as well as APS, and then "linked them together" to test the time and memory.

In the "roughest" of these tests, my machine was able to do this in less than 3 minutes, and only took about 13GB of RAM and 20% CPU utilization. My machine isn't particularly fancy. I have an Intel Core i7-11370H (quad core with two-way hyper-threading, 3.3GHz), with 32 GB RAM. So, that gives me hope that fastLink can be a viable solution.

My next steps would be to get the individual data sets cleaned, de-duplicated as much as possible, and organized in preparation for a merge. 🎉

corvidfox commented 1 year ago

This week I've made large progress towards one of the major goals of this issue:

I have:

Roadblocks I plan to focus on next:

corvidfox commented 1 year ago

This week I've made large progress towards one of the major goals of this issue:

I have:

Roadblocks I plan to focus on next:

corvidfox commented 1 year ago

This week I was able to:

I'll look over it again with fresher eyes next week to ensure it really is done, and fix some format issues.

So far it doesn't look like it's a good idea to consolidate the data down to a single row, as some groups appear to be clearly the same person who has either been listed at more than one address, or goes by at least one other name. That would likely result in a large number of mismatches between APS and MedStar data. That is a consideration for later in the process.

corvidfox commented 1 year ago

Accidentally closed due to attaching issue to pull request for partial completion. Reopened due to ongoing task.

corvidfox commented 1 year ago

This week I was able to finalize the Unique IDs in the MedStar data, and linked the observations that appeared in both the ePCR and Compliance data. Since Compliance does not have many identifiers, only Response Number produced any credible connections.

I also got a preliminary codebook for the MedStar data - I'll be able to polish that up more next week and then that's 1/2 data sets that could be individually used for some sort of analysis

corvidfox commented 1 year ago

This week I was able to do the initial clean of the APS data set, including a preliminary codebook. We are pending feedback from APS regarding clarification on some observations. APS data included a unique subject ID, which did not appear to have any false matches.

Initial exploration of merging APS and MedStar data sets (through fastLink pairing) has started. It seems I'm finding some failed matches for APS Person ID as I explore the variable combinations for the merge. I'll continue to assess so I can decide if I should do a fastLink match within the APS data to make a unique subject ID similar to how I made one in the MedStar Data.

Files are in pull request #47

corvidfox commented 1 year ago

This week I made some progress in how I'm manually reviewing the matches between the MedStar & APS Data sets.

Should have feedback from APS this week, which should (hopefully) help resolve the remaining issues in the APS data set.

corvidfox commented 1 year ago

As of Pull Request #47, the MedStar/APS Merge Map and some revisions to Source Subject IDs in both data sets should be achieved. The unique ID linking both data sets has been added to the original data sets.

corvidfox commented 1 year ago

As of Pull Request #49, the Intake-Response pairs have been identified. Initial merges have been created.

corvidfox commented 1 year ago

As of pull request #50, there are 3 merges created. Codebooks for all merges have also been created.

mbcann01 commented 1 year ago

Hi @corvidfox , I haven't had a chance to look at the actual code books yet. I'm viewing this on my phone, but it sounds great! Thank you!