datagovuk / ckanext-dgu

CKAN extension for data.gov.uk
http://data.gov.uk/
34 stars 33 forks source link

Organogram - Missing senior CSV #562

Closed davidread closed 7 years ago

davidread commented 7 years ago

Forestry Commission Sept 2016 is missing the senior CSV.

screen shot 2017-01-31 at 14 07 34

https://data.gov.uk/dataset/organogram-forestry-commission

davidread commented 7 years ago

From the history, it looks like the senior CSV was written and then in the next write from Drupal, when the "Organogram viewer" was added, it took out the senior CSV again. https://data.gov.uk/dataset/organogram-forestry-commission%402017-01-30T15%3A00%3A55.099937

davidread commented 7 years ago

This has happened again: https://data.gov.uk/dataset/organogram-united-kingdom-atomic-energy-authority @ratajczak do you know why this happens?

ratajczak commented 7 years ago

no, not yet. I investigated this but haven't found the cause yet. I'll keep looking

davidread commented 7 years ago

Ok, let us know

ratajczak commented 7 years ago

This bug is fixed on production, it was related to a different data format in resources from legacy datasets. I've also fixed these two datasets listed here but all organograms published between 11 Jan and 12 Feb on datasets with legacy resources are affected by the same issue. Could you please check if there are any datasets matching above condition?

davidread commented 7 years ago

Our CKAN expects dates to be DD/MM/YYYY, MM/YYYY or YYYY which it then converts to YYYY-MM-DD, YYYY-MM or YYYY which goes into the database table. However I see that we have plenty of times stored the slashes version. Either way, when it is displayed, it is forgiving and gets it right whichever way it was stored.

I think Drupal using is using the action API to read and write datasets (/api/action/package_create) and you're using DD/MM/YYYY - is that correct?

What date are you tripping up on?

I don't think Drupal should care how resources express their dates, when it is not the source of those other resources.

ratajczak commented 7 years ago

Organogram code in Drupal expected date format with slashes which it writes to CKAN by itself. It also handled format with hyphens but didn't expect M/YYYY format when looping through resources. e.g. https://data.gov.uk/api/3/action/package_show?id=organogram-forestry-commission This format is now handled too, so this issue should be fixed for good. I just wanted to check if there are any more datasets affected.

davidread commented 7 years ago

Got you! These are the ones that published in that date range:

arts-and-humanities-research-council attorney-generals-office cabinet-office care-quality-commission department-for-communities-and-local-government department-for-environment-food-and-rural-affairs department-for-transport economic-and-social-research-council engineering-and-physical-sciences-research-council environment-agency forestry-commission her-majestys-revenue-and-customs hm-crown-prosecution-service-inspectorate human-fertilisation-and-embryology-authority independent-police-complaints-commission joint-nature-conservation-committee medical-research-council national-army-museum natural-england nhs-blood-and-transplant office-of-rail-and-road royal-air-force-museum rural-payments-agency student-loans-company-limited the-national-museum-of-the-royal-navy the-northern-lighthouse-board transport-focus trinity-house-lighthouse-service united-kingdom-atomic-energy-authority united-kingdom-hydrographic-office valuation-office-agency water-services-regulation-authority

So I assume you can update those?

One more thing, I don't know why, but the update you did to forestry commission removed one of the legacy resources - the bottom right one: screen shot 2017-02-14 at 10 11 28 when it added the senior CSV. This is also shown in the diff of those changes that day:

Resource-0a76-description
  - Senior  staff data March 2014 (from legacy dataset)
  + Organogram - Senior CSV data

Resource-0a76-extras
  - {u'date': u'2014-03'}
  + {u'date': u'30/9/2016'}

Resource-0a76-url
  - http://data.defra.gov.uk/ops/forestry_commission_staff_data/Senior+staff+data+March+2014.csv
  + https://data.gov.uk/sites/default/files/organogram/forestry-commission/30/9/2016/300916%20Forestry%20Commission%20Organogram-senior.csv

It reuses the resource ID, which starts 0a76, which itself is not a problem, but clearly losing the legacy resource is something we need to avoid.

ratajczak commented 7 years ago

I reviewed all of them and most of them hadn't had legacy resources so they were all fine. This one had it but it was not updated by Drupal so it's also fine: https://data.gov.uk/dataset/organogram-department-for-environment-food-and-rural-affairs

This lost legacy resource issue should not happen again. I investigated this and found that senior resource from 2016 was not missing, but because of this error (which is now fixed) it overridden this legacy resource.

Also I've found by the way that this resource is marked as PDF but it's HTML, I thought that you may want to know that: https://data.gov.uk/dataset/organogram-trinity-house-lighthouse-service/resource/9239b8cc-d4ac-49a5-84fe-983240ee9201

davidread commented 7 years ago

Superb, thanks for all the work on this.