catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Nightly Build Failure 2024-03-07 #3449

Closed zaneselvans closed 6 months ago

zaneselvans commented 6 months ago

Overview

This seems to be another iteration of the failure from 2 days ago in #3441 stemming from Arelle having trouble with some cached file that it downloads from xbrl.org

This issue seems to maybe be related to the problems with the XBRL archiver problems.

The first questionable errors that show up in the logs seems to be:

2024-03-07 06:12:39,428 [webCache:cacheDownloadRenamingError] [Errno 2] No such file or directory: '/home/mambauser/.config/arelle/cache/http/www.xbrl.org/2003/xbrl-linkbase-2003-12-31.xsd.tmp' -> '/home/mambauser/.config/arelle/cache/http/www.xbrl.org/2003/xbrl-linkbase-2003-12-31.xsd' 
Unsuccessful renaming of downloaded file to active file /home/mambauser/.config/arelle/cache/http/www.xbrl.org/2003/xbrl-linkbase-2003-12-31.xsd 
Please remove with file manager. - 

2024-03-07 06:12:39,429 [IOerror] sched-234_2022-01-01_def.xml: file error: [Errno 2] No such file or directory - ../../schedules/ScheduleAccumulatedDeferredIncomeTaxes/sched-234_2022-01-01.xsd 4

Next steps

Verify that everything is fixed!

Once you've applied any necessary fixes, make sure that the nightly build outputs are all in their right places.

- [ ] [S3 distribution bucket](https://s3.console.aws.amazon.com/s3/buckets/pudl.catalyst.coop?region=us-west-2&bucketType=general&prefix=nightly/&showversions=false) was updated at the expected time
- [ ] [GCP distribution bucket](https://console.cloud.google.com/storage/browser/pudl.catalyst.coop/nightly;tab=objects?project=catalyst-cooperative-pudl) was updated at the expected time
- [ ] [GCP internal bucket](https://console.cloud.google.com/storage/browser/builds.catalyst.coop) was updated at the expected time
- [ ] [Datasette PUDL version](https://data.catalyst.coop/pudl/core_pudl__codes_datasources) points at the same hash as [nightly](https://github.com/catalyst-cooperative/pudl/tree/nightly)
- [ ] [Zenodo sandbox record](https://sandbox.zenodo.org/doi/10.5072/zenodo.5563) was updated to the record number in the logs (search for `zenodo_data_release.py` and `Draft` in the logs, to see what the new record number should be!)

Relevant logs

[link to build logs from internal distribution bucket]( PLEASE FIND THE ACTUAL LINK AND FILL IN HERE )

zaneselvans commented 6 months ago

@jdangerx Should we say that this seems to have fixed itself for the moment?

jdangerx commented 6 months ago

I've been able to reproduce this fairly consistently with the following snippet:

from arelle import Cntlr, ModelManager, ModelXbrl, WebCache

from concurrent.futures import ProcessPoolExecutor
import multiprocessing as mp

def load_tax(_i):
    cntlr = Cntlr.Cntlr()
    model_manager = ModelManager.initialize(cntlr)
    taxonomy_url = "https://eCollection.ferc.gov/taxonomy/form60/2022-01-01/form/form60/form-60_2022-01-01.xsd"
    taxonomy = ModelXbrl.load(model_manager, taxonomy_url)
    return 1

if __name__ == '__main__':
    cntlr = Cntlr.Cntlr()
    cache = WebCache.WebCache(cntlr, None)
    cache.clear()
    with ProcessPoolExecutor(max_workers=10, mp_context=mp.get_context('fork')) as executor:
        taxonomies = [t for t in executor.map(load_tax, range(5))]

The issue is, I think, that we split up the ferc_to_sqlite into form-specific ops - which works fine when you aren't using subprocesses, but once you have multiple sub-processes trying to write stuff to the cache at the same time, we run into a race condition where two processes try to execute this code at the same time:

        if reload or not filepathExists:
            return filepath if self._downloadFile(url, filepath) else None

P1 and P2 both see that not filepathExists; then P1 successfully downloads the file, and P2 tries to download the file but runs into:

FileExistsError: [Errno 17] File exists: '/Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/form/form60'

I think what we need to do is warm the cache by making an op that fetches all the taxonomies ahead of time. Then all the actual extraction ops can depend on the warmed cache - which means for all processes, filepathExists will always be True and then we will avoid the race condition.

jdangerx commented 6 months ago

This is the same problem as catalyst-cooperative/pudl-archiver#285, but we need to solve it separately because we're not operating in a Dagster environment. More thoughts there.

zaneselvans commented 6 months ago

@jdangerx I think this has been fixed with the new version of the extractor, so I'm closing.