Closed krlmlr closed 3 years ago
what do you mean with point one: Set of municipalities identical for all source years? Don't they change when a merger happens, so they are not identical over the years...?
I think the task is to check if all municipalities that exist in a given year properly appear in our mapping tables, either as a target or as a source municipality, or both. Please correct me, if this is a wrong assumption.
I see. Yea I just thought the same.
I tweaked the original description.
there might be a problem with the source municipalities.
Assumption: for each target year, the source municipalities should be the same. e.g a distinct() of all source municipalities of the source year 2005 should be the same for all target years. Thus the length of the distinct(mun_id_x, mun_id_y) of a certain source year should be the same over all target years. (reprex does not work since the files can't be loaded)
library(dplyr)
library(purrr)
library(here)
read_mun_csv <- function(file){
target_year <- sub(".csv", "", basename(file))
df <- read.csv(file) %>%
mutate(target_year = target_year)
return(df)
}
all_files <- list.files(here::here("inst/csv/flat"))
file_paths <- paste0(here::here("inst/csv/flat/"), all_files)
all_files <- file_paths %>% map_df(~read_mun_csv(.))
data_source_year <- all_files %>%
filter(year == 2005)
length_target_year <- data_source_year %>%
filter(target_year == 2005) %>%
distinct(mun_id_x, mun_id_y) %>%
nrow()
source_year_data <- data_source_year %>%
distinct(target_year, mun_id_x, short_name_x) %>%
group_by(target_year) %>%
summarize(count_2005 = n())
output:
target_year | count_distinct_mun_source_year_2005 |
---|---|
2005 | 2781 |
2006 | 2769 |
2007 | 2769 |
2008 | 2769 |
2009 | 2769 |
2010 | 2769 |
2011 | 2769 |
2012 | 2769 |
2013 | 2769 |
2014 | 2769 |
2015 | 2769 |
2016 | 2769 |
2017 | 2769 |
2018 | 2769 |
2019 | 2770 |
2020 | 2770 |
2021 | 2771 |
This check requires that all tables for all years are read. I don't think reviewing consistency within one file will be sufficient.
But the code does just this. Let me take a closer look.
I pushed the code in the branch f-23-check-completeness. I added a check.R file in the scripts directory. It creates a cross table with the target year on the y axis and the source years on the x axis
Thanks, I'm reaching the same conclusions.
I pushed a version of the code that analyzes for all source and for all target years at once. Target years look ok, source years don't.
The script shows the municipality IDs that are missing. We should search them in the CSV data to understand what mutations these municipalities are involved in.
I saw, that one example is Biel, which is written once: Biel/Bienne and once Biel (BE)
That's a rename, but there are other problems if we search only for municipality ID. Biel could be an important hint, though.
(And perhaps really the source name should also be identical for all cases.)
when I only make a distinct() of the mun_id_x, I get the same results. So Biel undertakes a id change as well... -> nope it does not... only name change
This is consistent with the output of my script.
1029 Herrlisberg has been abolished end of 2004, but is still present in the 2005 source.
yea there are a couple of those
1105,Wilihof same.
Another off-by-one $%&#§.
what is now the criteria for definition of the year? -1 day + 1year?
maybe we check all the unique admission dates and try to define a rule to achieve the goal
We could also get rid of Herrlisberg and Wilihof first, and then see what remains.
do we need to make the same rule for admission date as well as abolition date?
So that we can say, that if a mercher takes place during the year, we set the abolition date to 31.12 and the admission date to 1.1 of the next year. I if we only change one date, that might cause the problem
I have pushed a tweak that looks promising, still one failure when running write_all_mapping_tables()
with Klosters in 2021.
sorry, I had the call and now have to go. will check in on Monday :).
At last!
This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.
We're looking only for internal consistency here -- within the mapping tables. Consistency with external data is important too, but a different issue.