Closed zaneselvans closed 6 months ago
@jdangerx Should we say that this seems to have fixed itself for the moment?
I've been able to reproduce this fairly consistently with the following snippet:
from arelle import Cntlr, ModelManager, ModelXbrl, WebCache
from concurrent.futures import ProcessPoolExecutor
import multiprocessing as mp
def load_tax(_i):
cntlr = Cntlr.Cntlr()
model_manager = ModelManager.initialize(cntlr)
taxonomy_url = "https://eCollection.ferc.gov/taxonomy/form60/2022-01-01/form/form60/form-60_2022-01-01.xsd"
taxonomy = ModelXbrl.load(model_manager, taxonomy_url)
return 1
if __name__ == '__main__':
cntlr = Cntlr.Cntlr()
cache = WebCache.WebCache(cntlr, None)
cache.clear()
with ProcessPoolExecutor(max_workers=10, mp_context=mp.get_context('fork')) as executor:
taxonomies = [t for t in executor.map(load_tax, range(5))]
The issue is, I think, that we split up the ferc_to_sqlite
into form-specific ops - which works fine when you aren't using subprocesses, but once you have multiple sub-processes trying to write stuff to the cache at the same time, we run into a race condition where two processes try to execute this code at the same time:
if reload or not filepathExists:
return filepath if self._downloadFile(url, filepath) else None
P1 and P2 both see that not filepathExists
; then P1 successfully downloads the file, and P2 tries to download the file but runs into:
FileExistsError: [Errno 17] File exists: '/Users/dazhong-catalyst/Library/Caches/Arelle/https/eCollection.ferc.gov/taxonomy/form60/2022-01-01/form/form60'
I think what we need to do is warm the cache by making an op that fetches all the taxonomies ahead of time. Then all the actual extraction ops can depend on the warmed cache - which means for all processes, filepathExists
will always be True
and then we will avoid the race condition.
This is the same problem as catalyst-cooperative/pudl-archiver#285, but we need to solve it separately because we're not operating in a Dagster environment. More thoughts there.
@jdangerx I think this has been fixed with the new version of the extractor, so I'm closing.
Overview
This seems to be another iteration of the failure from 2 days ago in #3441 stemming from Arelle having trouble with some cached file that it downloads from xbrl.org
This issue seems to maybe be related to the problems with the XBRL archiver problems.
The first questionable errors that show up in the logs seems to be:
Next steps
Verify that everything is fixed!
Once you've applied any necessary fixes, make sure that the nightly build outputs are all in their right places.
Relevant logs
[link to build logs from internal distribution bucket]( PLEASE FIND THE ACTUAL LINK AND FILL IN HERE )