datagovuk / ckanext-dgu

CKAN extension for data.gov.uk
http://data.gov.uk/
34 stars 33 forks source link

Junior roles truncated #462

Closed davidread closed 8 years ago

davidread commented 8 years ago

Only a fraction of the junior roles are shown in the organogram view - many are missed out. I'm looking at ACAS September 2011 organogram, clicking on "Director of Delivery".

The old and new systems both have the number "693.09" (number of junior roles, FTE). This matches the spreadsheet - if I filter for the rows reporting to Senior Post 2, the sum of the "Number of Posts in FTE" column is 693.09. And the number of rows is 192, so I'd expect that number of boxes in the organogram. The old system's viz shows roughly that number, but the new system's viz shows only 11.

Old: screen shot 2016-08-11 at 16 31 29screen shot 2016-08-11 at 16 37 04 http://reference.data.gov.uk/gov-structure/organogram/?pubbod=advisory-conciliation-and-arbitration-service&version=2011-09-30

New:

screen shot 2016-08-11 at 16 31 57

https://test.data.gov.uk/organogram/advisory-conciliation-and-arbitration-service/2011-09-30

davidread commented 8 years ago

I've also seen this on another organogram:

BBSRC, September 2011 Junior roles for "DIRECTOR OF THE INSTITUTE OF FOOD RESEARCH (UEA EMPLOYEE)"

ratajczak commented 8 years ago

I think that they are separated issues, I've identified and fixed Biotechnology and Biological Sciences Research Council. This was down to bad generation of junior role 'id' in the hierarchy. 'id': post['Reporting Senior Post'] + "" + post['Grade'], so junior roles with the same grade reporting to the same senior got overridden. I've changed it to include job title: 'id': post['Reporting Senior Post'] + "" + post['Grade'] + "_" + post['Generic Job Title'],

I believe that ACAS issue is caused by etl script, the number of junior roles visible here: https://test.data.gov.uk/organogram/advisory-conciliation-and-arbitration-service/2011-09-30 is coming from csv generated by the script: http://test.data.gov.uk/organogram-ajax/preview/data/advisory_conciliation_and_arbitration_service-2011-09-30-organogram-junior.csv

This is an old csv file because the XLS file used to generate it: co-prod2.dh.bytemark.co.uk:/home/co/organogram-data/data/dgu/xls-from-triplestore/advisory_conciliation_and_arbitration_service-2011-09-30-organogram.xls now fails validation and couldn't be uploaded any more so we can't be sure if we still get this truncated csv out of the script.

davidread commented 8 years ago

I'm not convinced that generating an id from the row data is going to always work. For example, that same ACAS Excel sheet has all these that only vary by Unit:

screen shot 2016-08-22 at 15 27 09 screen shot 2016-08-22 at 15 38 49

I imagine it is also valid to have identical rows.

Can you simply assign an ID using the row number?

ACAS is now showing 76 junior roles, which will increase when this is fixed.

I believe the XLS does indeed validate:

(ckan)co@precise64:/src/organograms$ ./etl_to_csv.py data/dgu/xls-from-triplestore/advisory_conciliation_and_arbitration_service-2011-09-30-organogram.xls /tmp/ --date-from-filename
Loading data/dgu/xls-from-triplestore/advisory_conciliation_and_arbitration_service-2011-09-30-organogram.xls
Writing /tmp/advisory_conciliation_and_arbitration_service-2011-09-30-organogram-senior.csv
Writing /tmp/advisory_conciliation_and_arbitration_service-2011-09-30-organogram-junior.csv
Writing index file: /tmp/index.json
Done.
(ckan)co@precise64:/src/organograms$ grep '"2","Grade' /tmp/advisory_conciliation_and_arbitration_service-2011-09-30-organogram-junior.csv |wc -l
192
ratajczak commented 8 years ago

Well done, looks like this was it. Fixed on prod2 by using row number as id. I think that it works like it should, could you please take a look?

davidread commented 8 years ago

Yes, all 192 are there now :)