OpenDataServices / flatten-tool

Tools for generating CSV and other flat versions of the structured data
http://flatten-tool.readthedocs.io/en/latest/
MIT License
104 stars 15 forks source link

Unflatten is sliently dropping all but last index of supplier array from UK government procurement CSVs #384

Open poulson opened 3 years ago

poulson commented 3 years ago

2021.04.20.csv

Despite the clear existence of 31 separate suppliers (including Palantir Technologies) in this single-row input CSV (extracted from yesterday's export of UK government procurement contracts), unflatten is only preserving the last supplier (Workday).

I have been unflattening using a command of the form

flatten-tool unflatten input-dir --root-id=ocid --root-is-list --input-format csv --encoding ascii --output-name unflattened.json

The relevant -- and incomplete -- portion of the output is:

                "awards": [
                    {
                        "id": "6df7e3ce-54f4-4151-a3ed-0dfc7aead845",
                        "description": "See description of related tender",
                        "status": "active",
                        "date": "2021-03-26T00:00:00Z",
                        "value": {
                            "amount": "1200000000.0",
                            "currency": "GBP"
                        },
                        "suppliers": [
                            {
                                "id": "0.0",
                                "identifier": {
                                    "scheme": "GB-COH",
                                    "id": "521013.0"
                                },
                                "name": "WORKDAY LIMITED",
                                "address": {
                                    "streetAddress": "THE KING'S BUILDING,MAY LANE\nDUBLIN 7\nIE"
                                },
                                "x_awardValue": {
                                    "currency": "GBP"
                                },
                                "sme": "1.0"
                            }
                        ],
                        "contractPeriod": {
                            "startDate": "2021-04-06T00:00:00Z",
                            "endDate": "2024-12-09T00:00:00Z"
                        }
                    }
                ]

While I understand that a schema would be of use, I don't understand after reading https://flatten-tool.readthedocs.io/en/latest/unflatten/ why most of the supplier columns are being entirely ignored. I am therefore posting here because this looks like a bug in unflatten.

jpmckinney commented 3 years ago

I believe the issue is that the UK assigns the same id to all awards, which the unflatten routine ends up merging into a single row. I believe the command outputs a warning about this, but I could be wrong.

I think if you delete the id values, then the command will work as expected.

poulson commented 3 years ago

When run from the commandline I don't see any warnings, but I will indeed look into preprocessing out the id values and appreciate the tip.

poulson commented 3 years ago

FWIW, I have confirmed that dropping the suppliers/id columns before unflattening fixes the problem. I agree that one would hope a warning would have been printed about this and appreciate the help.