Open mbcann01 opened 7 months ago
Initial review of the APS subject identifier data appeared to indicate that the variable (client_id) would be readily valid, with a few "failed matches" noted due to multiple (client_id) values associated with certain (case_id) values (each case should only belong to a single subject). Data largely appeared to be clean and ready to use.
However, further examination has found significant typographical and other errors in the identifier data fields and a much larger degree of "failed matches" for subject-id values. As such, the data requires significant cleaning in preparation for fuzzy-matching algorithm application and a within-set APS subject ID would need to be created.
Cleaning of APS data is underway.
Name fields:
Potentially valuable information (such as "female" if a name was given as "unknown female" or a suffix trimmed from a name value) is being shifted to a comment field, so it is available in manual review of fuzzy-match pairs.
Additionally, some exploration of address values has been completed.
As of today, further progress has been made in cleaning/standardizing the APS Client data
Name fields are clean! :tada:
Address fields are pending only street address cleaning/validation:
As of today, further progress has been made in cleaning/standardizing the APS Client data.
Address fields are pending only street address cleaning/validation completion. Street Addresses are a bear.
As of today, further progress has been made in cleaning/standardizing the APS Client data
Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.
As of today, further progress has been made in cleaning/standardizing the APS Client data
Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.
As of today, further progress has been made in cleaning/standardizing the APS Client data. Separation of secondary address values should be complete (within reason) at this time, though QC checks are designed to help catch other potential remaining values.
Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.
As of today, further progress has been made in cleaning/standardizing the APS Client data. Separation of secondary address values should be complete (within reason) at this time, though QC checks are designed to help catch other potential remaining values.
Address fields are pending only street address and street unit cleaning/validation completion. Street Addresses are a bear.
Progress was slowed as an error was found in the "client_notes" field processing. This variable has significant potential value in expediting manual review of pairs in fuzzy matching, as it's used to hold original values and other notes from this clean/prep process. The entire code file was systematically reviewed, point fixes were made, and it should be completely resolved at this time.
As of today, further progress has been made in cleaning/standardizing the APS Client data. Street address values at the maximum length for the field (30 characters), which were vulnerable to unusual issues that might fail standardization processing due to truncation, have been resolved. Standardization of common elements of street address values (county road, farm to market, private road, etc.) is currently underway. Street Addresses remain a bear.
After the prior pause in progress, I am now getting back into the flow of the task and should hopefully be able to increase the pace of my progress in the coming weeks. My latest commits will arrive later this evening, as I am experiencing some unusual slowness to my internet connection.
As of today, further progress has been made in cleaning/standardizing the APS Client data. Standardization of common elements of street address values (county road, farm to market, private road, etc.) is currently underway. County Roads were standardized last week. Farm to Market roads are currently undergoing standardization, with the majority of the 10,000+ impacted observations identified and sorted.
Street Addresses remain a bear - there will be benefit to taking time to write out a standardized protocol/workflow for similar tasks once this data set is prepped/cleaned for fuzzy matching.
As of today, further progress has been made in cleaning/standardizing the APS Client data. Standardization of common elements of street address values (county road, farm to market, private road, etc.) is currently underway. Farm to Market Roads have been standardized. Private Roads are in-process. Next will be rural routes/roads, and highway/freeway/interstate elements.
Street Addresses remain a bear - there will be benefit to taking time to write out a standardized protocol/workflow for similar tasks once this data set is prepped/cleaned for fuzzy matching.
As of today, further progress has been made in cleaning/standardizing the APS Client data. Standardization of common elements of street address values (county road, farm to market, private road, etc.) is currently underway. Private Roads have been standardized. Rural/Ranch Roads/Routes are in-process. Next will be highway/freeway/interstate elements, then cardinal directions. After that is QC checks.
Street Addresses remain a bear - there will be benefit to taking time to write out a standardized protocol/workflow for similar tasks once this data set is prepped/cleaned for fuzzy matching.
As of today, further progress has been made in cleaning/standardizing the APS Client data. Street address values are in QC checks. Current progress is on track to have the APS Client data clean/prep complete by the end of October. As previously stated, there will likely be benefit to taking time to write out a standardized protocol/workflow for similar tasks once this data set is prepped/cleaned for fuzzy matching.
As of today, the APS client data is cleaned and prepped for fuzzy matching!
The next step of the process is creating a within-set APS subject ID from this data.
APS Within-Set Fuzzy Matching is in progress. Due to the size of the data, 5 chunks were required. These chunks have had initial fuzzy-matching performed, and will be individually cleaned before they are iteratively "folded" back into a single set through between-set matching and cleaning.
Overview
On 2024-03-21, Catherine sent us a new batch of APS data. We need to merge the APS outcomes with our DETECT screenings for a publication.
We want to link APS investigation outcomes to DETECT screenings completed by MedStar during the R01 phase of DETECT. The DETECT data we want to use for linking is
participant_import.rds
.Links
Tasks