catalyst-cooperative / ferc-xbrl-extractor

A tool for converting FERC filings published in XBRL into SQLite databases
MIT License
11 stars 0 forks source link

Datapackage descriptors annotating SQLite DBs are invalid #153

Open zaneselvans opened 9 months ago

zaneselvans commented 9 months ago

The datapackage descriptors we are currently generating to annotate the SQLite DBs which are derived from XBRL data are not valid. For example, in the ferc-xbrl-extractor environment running this command:

frictionless validate ferc714_xbrl_datapackage.json

Results in a bunch of errors like:

# -------
# invalid: sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite
# -------

## Summary

+-----------------------------+--------------------------------------------------------------------------+
| Description                 | Size/Name/Count                                                          |
+=============================+==========================================================================+
| File name (Not Found)       | sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite |
+-----------------------------+--------------------------------------------------------------------------+
| File size                   | N/A                                                                      |
+-----------------------------+--------------------------------------------------------------------------+
| Total Time Taken (sec)      | 0.002                                                                    |
+-----------------------------+--------------------------------------------------------------------------+
| Total Errors                | 1                                                                        |
+-----------------------------+--------------------------------------------------------------------------+
| Scheme Error (scheme-error) | 1                                                                        |
+-----------------------------+--------------------------------------------------------------------------+

## Errors

+-------+---------+---------+---------------------------------------------------+
| row   | field   | code    | message                                           |
+=======+=========+=========+===================================================+
|       |         | scheme- | The data source could not be successfully loaded: |
|       |         | error   | cannot create loader "". Try installing           |
|       |         |         | "frictionless-"                                   |
+-------+---------+---------+---------------------------------------------------+

Or if we try and validate a single resource and return the errors in JSON form:

frictionless validate --json --resource-name planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant ferc714_xbrl_datapackage.json
{
  "version": "4.40.11",
  "time": 0.001,
  "errors": [],
  "tasks": [
    {
      "resource": {
        "path": "sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite",
        "profile": "tabular-data-resource",
        "name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant",
        "dialect": {
          "table": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant"
        },
        "title": "03.2 - Schedule - Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load, Table - instant",
        "description": "ferc:SchedulePlanningAreaHourlyDemandAndForecastSummerAndWinterPeakDemandAndAnnualNetEnergyForLoadBAbstract",
        "format": "sqlite",
        "mediatype": "application/vnd.sqlite3",
        "schema": {
          "fields": [
            {
              "name": "entity_id",
              "title": "Entity Identifier",
              "type": "string",
              "format": "default",
              "description": "Unique identifier of respondent"
            },
            {
              "name": "filing_name",
              "title": "Filing Name",
              "type": "string",
              "format": "default",
              "description": "Name of filing"
            },
            {
              "name": "date",
              "title": "Instant Date",
              "type": "date",
              "format": "default",
              "description": "Date of instant period"
            },
            {
              "name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis",
              "title": "Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load [Axis]",
              "type": "string",
              "format": "default",
              "description": "Typed dimension used to distinguish a set of related facts about planning area hourly demand and forecast summer and winter peak demand and annual net energy for load."
            }
          ],
          "primary_key": [
            "entity_id",
            "filing_name",
            "date",
            "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis"
          ]
        },
        "scheme": "",
        "hashing": "md5",
        "stats": {
          "hash": "",
          "bytes": 0
        }
      },
      "time": 0.001,
      "scope": [],
      "partial": false,
      "errors": [
        {
          "code": "scheme-error",
          "name": "Scheme Error",
          "tags": [],
          "note": "cannot create loader \"\". Try installing \"frictionless-\"",
          "message": "The data source could not be successfully loaded: cannot create loader \"\". Try installing \"frictionless-\"",
          "description": "Data reading error because of incorrect scheme."
        }
      ],
      "stats": {
        "errors": 1
      },
      "valid": false
    }
  ],
  "stats": {
    "errors": 1,
    "tasks": 1
  },
  "valid": false
}

The problem?

I think the issue here is that we are using v4 of the frictionless package, and the ability to annotate SQLite DBs was only introduced in v5. Looking at the datapacakge.json file, I see that tie dialect field is invalid. In frictionless v5, it would need to say sql and and then point at the table within the sql dictionary. in previous versions it would describe the CSV dialect that's being used in the file that the path element points at. See this example of data package annotating an SQLite DB.

{
            "path": "sqlite:////Users/zane/code/catalyst/pudl-work/output/ferc714_xbrl.sqlite",
            "profile": "tabular-data-resource",
            "name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant",
            "dialect": {
                "table": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_instant"
            },
            "title": "03.2 - Schedule - Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load, Table - instant",
            "description": "ferc:SchedulePlanningAreaHourlyDemandAndForecastSummerAndWinterPeakDemandAndAnnualNetEnergyForLoadBAbstract",
            "format": "sqlite",
            "mediatype": "application/vnd.sqlite3",
            "schema": {
                "fields": [
                    {
                        "name": "entity_id",
                        "title": "Entity Identifier",
                        "type": "string",
                        "format": "default",
                        "description": "Unique identifier of respondent"
                    },
                    {
                        "name": "filing_name",
                        "title": "Filing Name",
                        "type": "string",
                        "format": "default",
                        "description": "Name of filing"
                    },
                    {
                        "name": "date",
                        "title": "Instant Date",
                        "type": "date",
                        "format": "default",
                        "description": "Date of instant period"
                    },
                    {
                        "name": "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis",
                        "title": "Planning Area Hourly Demand and Forecast Summer and Winter Peak Demand and Annual Net Energy for Load [Axis]",
                        "type": "string",
                        "format": "default",
                        "description": "Typed dimension used to distinguish a set of related facts about planning area hourly demand and forecast summer and winter peak demand and annual net energy for load."
                    }
                ],
                "primary_key": [
                    "entity_id",
                    "filing_name",
                    "date",
                    "planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_axis"
                ]
            }
        }

Frictionless v4 can't interpret sqlite:// URL as path

As it is, the system sees the sqlite:// URL in the path and has no idea how to interpret it to find the data.

The sqlite:// path must be relative not absolute

In addition to being unable to interpret the sqlite:// URL as a path at all, the URL uses an absolute path rather than a relative path, which is invalid. It is invalid both in that it violates the frictionless data resource specification which says:

A “url-or-path” is a string with the following additional constraints:

In addition, the absolute path is simply wrong if you download our nightly build outputs since the path to which the descriptor and databases were written on the build server have no meaning on the user's machine:

sqlite:////home/catalyst/pudl_work/output/ferc1_xbrl.sqlite

What to do?

zschira commented 8 months ago

@zaneselvans thanks for investigating this. We do have an integration test that is supposed to check for valid datapackage's, but clearly it needs to be overhauled. We currently only test the Ferc1 datapackage, and it uses the Package property metadata_valid to check for validity, which is maybe insufficient?

zaneselvans commented 8 months ago

IIRC that check will probably only look at the Package-level metadata, and not recurse down into any of the resources which make up the package (which confused the heck out of me when I was first working with the datapackage validations).