Proportion of lost facts in tests seems non-deterministic

We had a slew of dependency updates this week which mean a bunch of CI tests got run, and many of them are failing, but in slightly different ways. It looks like the proportion of facts that get lost isn't the same every time. Are we doing some random sampling that might get different answers on different runs? If so maybe we want to select a fixed subset to test on, or set the random seed to a constant value?

Of the 24 runs (pull request + push for each of Python 3.10 & 3.11, across 6 dependabot PRs) there were the following 4 failures. So like a 17% failure rate?

example 1

_______________________ test_lost_facts_pct[form2_2021] ________________________

extracted = ExtractOutput(table_defs={'corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.FactTable o...s', 'c-00:compressor_station_equipment_gas_transmission_plant', 'c-1044:compressor_hours_of_operation_during_year'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form2_2021]>>

    def test_lost_facts_pct(extracted, request):
        table_defs, instances, table_data, stats = extracted
        total_facts = sum(len(i.fact_id_counts) for i in instances)
        total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())

        used_fact_ratio = total_used_facts / total_facts

        if "form6_" in request.node.name:
            # We have unallocated data for Form 6 for some reason.
            total_threshold = 0.9
            per_filing_threshold = 0.8
            # Assert that this is < 0.95 so we remember to fix this test once we
            # fix the bug. We don't use xfail here because the parametrization is
            # at the *fixture* level, and only the lost facts tests should fail
            # for form 6.
            assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
        else:
            total_threshold = 0.99
            per_filing_threshold = 0.95
>           assert used_fact_ratio > total_threshold and used_fact_ratio <= 1
E           assert (0.9854457831325302 > 0.99)

example 2

 _______________________ test_lost_facts_pct[form6_2021] ________________________

extracted = ExtractOutput(table_defs={'annual_corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.Fact...atio_for_long_term_debt_rate_of_return', 'iffa57d4b6aa645f68d9de3452bd9d47d_D20210101-20211231:schedule_exemption'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form6_2021]>>

    def test_lost_facts_pct(extracted, request):
        table_defs, instances, table_data, stats = extracted
        total_facts = sum(len(i.fact_id_counts) for i in instances)
        total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())

        used_fact_ratio = total_used_facts / total_facts

        if "form6_" in request.node.name:
            # We have unallocated data for Form 6 for some reason.
            total_threshold = 0.9
            per_filing_threshold = 0.8
            # Assert that this is < 0.95 so we remember to fix this test once we
            # fix the bug. We don't use xfail here because the parametrization is
            # at the *fixture* level, and only the lost facts tests should fail
            # for form 6.
>           assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
E           assert (0.8165926212091945 > 0.9)

example 3

_______________________ test_lost_facts_pct[form2_2021] ________________________

extracted = ExtractOutput(table_defs={'corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.FactTable o...1:disposition_of_excess_gas', 'c-910:other_utility_operating_income_associated_with_taxes_other_than_income_taxes'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form2_2021]>>

    def test_lost_facts_pct(extracted, request):
        table_defs, instances, table_data, stats = extracted
        total_facts = sum(len(i.fact_id_counts) for i in instances)
        total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())

        used_fact_ratio = total_used_facts / total_facts

        if "form6_" in request.node.name:
            # We have unallocated data for Form 6 for some reason.
            total_threshold = 0.9
            per_filing_threshold = 0.8
            # Assert that this is < 0.95 so we remember to fix this test once we
            # fix the bug. We don't use xfail here because the parametrization is
            # at the *fixture* level, and only the lost facts tests should fail
            # for form 6.
            assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
        else:
            total_threshold = 0.99
            per_filing_threshold = 0.95
>           assert used_fact_ratio > total_threshold and used_fact_ratio <= 1
E           assert (0.972722891566265 > 0.99)

example 4

And then in just this one case, there aren't any keyword / tags associated with the datapackage?

        if datapackage_path:
            # Verify that datapackage descriptor is valid before outputting
            frictionless_package = Package(descriptor=datapackage.dict(by_alias=True))
            if not frictionless_package.metadata_valid:
>               raise RuntimeError(
                    f"Generated datapackage is invalid - {frictionless_package.metadata_errors}"
                )
E               RuntimeError: Generated datapackage is invalid - [{'code': 'package-error',
E                'description': 'A validation cannot be processed.',
E                'message': 'The data package has an error: "[] is too short" at "resources" '
E                           'in metadata and at "properties/resources/minItems" in profile',
E                'name': 'Package Error',
E                'note': '"[] is too short" at "resources" in metadata and at '
E                        '"properties/resources/minItems" in profile',
E                'tags': []}]

And also a bunch of errors when trying to access the taxonomy files?

2023-09-13 22:14:15,948 [webCache:retrievalError] Forbidden 
retrieving https://ecollection.ferc.gov/taxonomy/form714/2022-01-01/form/form714/form-714_2022-01-01.xsd - 

2023-09-13 22:14:15,949 [FileNotLoadable] File can not be loaded: https://ecollection.ferc.gov/taxonomy/form714/2022-01-01/form/form714/form-714_2022-01-01.xsd - https://ecollection.ferc.gov/taxonomy/form714/2022-01-01/form/form714/form-714_2022-01-01.xsd

Oh wait no a couple had double failures, so it was more like a 25% failure rate:

example 5

_______________________ test_lost_facts_pct[form2_2021] ________________________

extracted = ExtractOutput(table_defs={'list_of_schedules_002_duration': <ferc_xbrl_extractor.datapackage.FactTable object at 0x7f8...eferred_expense_account_charged', 'c-334:volume_not_collected_gas_lost_and_unaccounted_for', 'c-757:taxes_accrued'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form2_2021]>>

    def test_lost_facts_pct(extracted, request):
        table_defs, instances, table_data, stats = extracted
        total_facts = sum(len(i.fact_id_counts) for i in instances)
        total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())

        used_fact_ratio = total_used_facts / total_facts

        if "form6_" in request.node.name:
            # We have unallocated data for Form 6 for some reason.
            total_threshold = 0.9
            per_filing_threshold = 0.8
            # Assert that this is < 0.95 so we remember to fix this test once we
            # fix the bug. We don't use xfail here because the parametrization is
            # at the *fixture* level, and only the lost facts tests should fail
            # for form 6.
            assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
        else:
            total_threshold = 0.99
            per_filing_threshold = 0.95
>           assert used_fact_ratio > total_threshold and used_fact_ratio <= 1
E           assert (0.9869397590361446 > 0.99)

example 6

_______________________ test_lost_facts_pct[form2_2021] ________________________

extracted = ExtractOutput(table_defs={'corporate_officer_certification_001_duration': <ferc_xbrl_extractor.datapackage.FactTable o...ive_income_loss', 'c-776:taxes_accrued', 'c-93:acquired_to_meet_deficiency', 'cr-00196:utility_operating_expenses'}})})
request = <FixtureRequest for <Function test_lost_facts_pct[form2_2021]>>

    def test_lost_facts_pct(extracted, request):
        table_defs, instances, table_data, stats = extracted
        total_facts = sum(len(i.fact_id_counts) for i in instances)
        total_used_facts = sum(len(f_ids) for f_ids in stats["fact_ids"].values())

        used_fact_ratio = total_used_facts / total_facts

        if "form6_" in request.node.name:
            # We have unallocated data for Form 6 for some reason.
            total_threshold = 0.9
            per_filing_threshold = 0.8
            # Assert that this is < 0.95 so we remember to fix this test once we
            # fix the bug. We don't use xfail here because the parametrization is
            # at the *fixture* level, and only the lost facts tests should fail
            # for form 6.
            assert used_fact_ratio > total_threshold and used_fact_ratio <= 0.95
        else:
            total_threshold = 0.99
            per_filing_threshold = 0.95
>           assert used_fact_ratio > total_threshold and used_fact_ratio <= 1
E           assert (0.9417349397590361 > 0.99))

catalyst-cooperative / ferc-xbrl-extractor