cynkra / munch

Functions for working with the historicized list of communes of Switzerland.
https://munch.cynkra.com
6 stars 2 forks source link

Check completeness of mapping tables #23

Closed krlmlr closed 3 years ago

krlmlr commented 3 years ago

We're looking only for internal consistency here -- within the mapping tables. Consistency with external data is important too, but a different issue.

ThomasKnecht commented 3 years ago

what do you mean with point one: Set of municipalities identical for all source years? Don't they change when a merger happens, so they are not identical over the years...?

TSchiefer commented 3 years ago

I think the task is to check if all municipalities that exist in a given year properly appear in our mapping tables, either as a target or as a source municipality, or both. Please correct me, if this is a wrong assumption.

ThomasKnecht commented 3 years ago

I see. Yea I just thought the same.

krlmlr commented 3 years ago

I tweaked the original description.

ThomasKnecht commented 3 years ago

there might be a problem with the source municipalities.

Assumption: for each target year, the source municipalities should be the same. e.g a distinct() of all source municipalities of the source year 2005 should be the same for all target years. Thus the length of the distinct(mun_id_x, mun_id_y) of a certain source year should be the same over all target years. (reprex does not work since the files can't be loaded)

library(dplyr)
library(purrr)
library(here)

read_mun_csv <- function(file){
  target_year <- sub(".csv", "", basename(file))

  df <- read.csv(file) %>% 
    mutate(target_year = target_year)

  return(df)
}

all_files <- list.files(here::here("inst/csv/flat"))

file_paths <- paste0(here::here("inst/csv/flat/"), all_files)

all_files <- file_paths %>% map_df(~read_mun_csv(.))

data_source_year <- all_files %>% 
  filter(year == 2005)

length_target_year <- data_source_year %>% 
  filter(target_year == 2005) %>% 
  distinct(mun_id_x, mun_id_y) %>% 
  nrow()

source_year_data <- data_source_year %>% 
  distinct(target_year, mun_id_x, short_name_x) %>% 
  group_by(target_year) %>% 
  summarize(count_2005 = n())

output:

target_year count_distinct_mun_source_year_2005
2005 2781
2006 2769
2007 2769
2008 2769
2009 2769
2010 2769
2011 2769
2012 2769
2013 2769
2014 2769
2015 2769
2016 2769
2017 2769
2018 2769
2019 2770
2020 2770
2021 2771
krlmlr commented 3 years ago

This check requires that all tables for all years are read. I don't think reviewing consistency within one file will be sufficient.

krlmlr commented 3 years ago

But the code does just this. Let me take a closer look.

ThomasKnecht commented 3 years ago

I pushed the code in the branch f-23-check-completeness. I added a check.R file in the scripts directory. It creates a cross table with the target year on the y axis and the source years on the x axis

ThomasKnecht commented 3 years ago

https://github.com/cynkra/SwissCommunes/blob/f-23-check-completeness/script/check.R

krlmlr commented 3 years ago

Thanks, I'm reaching the same conclusions.

krlmlr commented 3 years ago

I pushed a version of the code that analyzes for all source and for all target years at once. Target years look ok, source years don't.

krlmlr commented 3 years ago

The script shows the municipality IDs that are missing. We should search them in the CSV data to understand what mutations these municipalities are involved in.

ThomasKnecht commented 3 years ago

I saw, that one example is Biel, which is written once: Biel/Bienne and once Biel (BE)

krlmlr commented 3 years ago

That's a rename, but there are other problems if we search only for municipality ID. Biel could be an important hint, though.

krlmlr commented 3 years ago

(And perhaps really the source name should also be identical for all cases.)

ThomasKnecht commented 3 years ago

when I only make a distinct() of the mun_id_x, I get the same results. So Biel undertakes a id change as well... -> nope it does not... only name change

krlmlr commented 3 years ago

This is consistent with the output of my script.

krlmlr commented 3 years ago

1029 Herrlisberg has been abolished end of 2004, but is still present in the 2005 source.

ThomasKnecht commented 3 years ago

yea there are a couple of those

krlmlr commented 3 years ago

1105,Wilihof same.

krlmlr commented 3 years ago

Another off-by-one $%&#§.

ThomasKnecht commented 3 years ago

what is now the criteria for definition of the year? -1 day + 1year?

ThomasKnecht commented 3 years ago

maybe we check all the unique admission dates and try to define a rule to achieve the goal

krlmlr commented 3 years ago

We could also get rid of Herrlisberg and Wilihof first, and then see what remains.

ThomasKnecht commented 3 years ago

do we need to make the same rule for admission date as well as abolition date?

So that we can say, that if a mercher takes place during the year, we set the abolition date to 31.12 and the admission date to 1.1 of the next year. I if we only change one date, that might cause the problem

krlmlr commented 3 years ago

I have pushed a tweak that looks promising, still one failure when running write_all_mapping_tables() with Klosters in 2021.

ThomasKnecht commented 3 years ago

sorry, I had the call and now have to go. will check in on Monday :).

krlmlr commented 3 years ago

At last!

github-actions[bot] commented 2 years ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.