A unique column names for "included" datasets

jwestw commented 3 years ago

@LucyGwilliamAdmin I am submitting a PR here, but would like you to check it before the merge. Please clone the code and run it. Happy to work with you on this, but needed someone to check my logic around this. Ultimately we need to be sure that the list that is printed on line 131 is the full list.

closes #16

LucyGwilliamAdmin commented 3 years ago

@jwestw I've started taking a look at this. From what I can see so far the df of included indicators (inc_df) has less indicators than UK-SDMX-tracker. It's hard to say how many for sure since output.csv is not in indicator id order.

Therefore it may actually be that the list of unique column names isn't complete

jwestw commented 3 years ago

Lucy and I have looked into this.

Conversation from Slack:

@LucyGwilliamAdmin said:

Ok - I think I've found the sources of discrepencies:

11.c.1 was existing in tracker but not in script output - script output is correct

only_uk_data should always equal FALSE if geo_disag is TRUE this is currently not the case in output

we also don't want to filter out geo_disagg==""

James: Let me just confirm my understanding.

Since the script output regarding 11.c.1 nothing needs to change in the script
"only_uk_data should always equal FALSE if geo_disag is TRUE this is currently not the case in output" => so something is wrong with the logic that makes the geo_disag col (check lines 73-82 in main.py of here). If the logic is not wrong there this leaves 2 more possiblities:
- the data is problematic e.g. some times a regional disagregation like 'Local Authority' is present, but it is meant to be UK only data
- the logic that checks for only_uk_data here is faulty.
  - Thanks. Good catch on that one. I will change the code.

LucyGwilliamAdmin commented 3 years ago

@jwestw

Yes re. 1 - no changes needed

I think re. 2, from my understanding this line is checking if the value in the national_geographical_coverage is listed in uk_terms (which are specifies in config):

df['only_uk_data'] = df.national_geographical_coverage.map(lambda x: x in uk_terms)

However only_uk_data should depend on two things:

value in the national_geographical_coverage is listed in uk_terms (as code is currently doing)
value in geo_disag column is FALSE

Both of those conditions need to met for only_uk_data to be TRUE

jwestw commented 3 years ago

Hey @LucyGwilliamAdmin You will see from my last commit I have applied some changes that should satisfy the requirements.

Since the logic is not only dependent on national_geographical_coverage series but also the geo_disag series, I have gone back to check all is well with that. Turns out there's a problem. I am expecting 74 records to have geo_disag == False, but I am getting 77.

So I ran some checks to find out why this is. Here we can see both "Country" and "Country group" are being captured when "Country" is being searched for.

I will sort this out with a bit of regex.

jwestw commented 3 years ago

Still looking into this. I am creating a disag.csv by putting all indicators which have disagregations (from the disagregation report) into a csv. Then using the spreadsheet software to search for each of the terms using AND logic:

I printing some text for a summary of what's happening.

The disagregation set is created by using .str.contains on the Disaggregations column with the regex pattern:
\bRegion\b|\bCountry\b|\bLocal Authority\b
The number of indicators found in the disagregation set is 77. 
The number of indicators with the same disaggregation terms in the spreadsheet is 74.

The indicators which are in the disaggregations but not in the spreadsheet are: 
{'10-b-1', '10-6-1', '16-8-1'}

jwestw commented 3 years ago

Just printing out the contents of the Disaggregation column for each of the indicators that are different.

Seems the regex is still not tight enough. Not sure why it's picking up "country" and "region" when the search pattern contains "Country" and "Region" (title case) and re.IGNORECASE isn't set.

for ind in {'10-b-1', '10-6-1', '16-8-1'}:
    #print the string in the Disaggregations column
    print(disag_df.Disaggregations.loc[ind])
    # Use Pandas to check if the regex search pattern is found in the entire Disaggregation series. 
    # Print only the row corresponding the indicator of interest, ind
    print(disag_df.Disaggregations.str.contains(geo_disag_terms, regex=True).loc[ind])
    # print the results of the re.search using the same regex pattern
    print(re.search(geo_disag_terms, disag_df.at[ind,"Disaggregations"]))

Output:

Recipient region, Recipient country
False
None
Developing country participation, Institution
False
None
Developing country participation, Institution
False
None

LucyGwilliamAdmin commented 3 years ago

@jwestw contains checks that at least a sub string of the column name matches the string, even if regex is used. You will need to check if the column name and geographical area are equal instead.

ONSdigital / sdg-SDMX-data-qualifier

A unique column names for "included" datasets #24