Describe the bug
datapusher-plus leaves temporary files in /tmp, which accumulate until available space is depleted
To Reproduce
Steps to reproduce the behavior:
Process various files with datapusher-plus
Observe /tmp
Expected behavior
After analysis is complete /tmp contains no temporary data files
Additional context
I suspect this happens because the NamedTemporaryFiles are used mostly as temp file name generators instead of file objects, so the automatic deletion does not work.
I propose using a TemporaryDirectory instead within a with statement which ensures it is deleted afterwards. The most ergonomic way of implementing this is splitting push_to_datastore into two functions: one to work as a @job.asynchronous and context manager, the other to actually process the data like so:
@job.asynchronous
def push_to_datastore(task_id, input, dry_run=False):
"""Download and parse a resource push its data into CKAN's DataStore.
An asynchronous job that gets a resource from CKAN, downloads the
resource's data file and, if the data file has changed since last time,
parses the data and posts it into CKAN's DataStore.
:param dry_run: Fetch and parse the data file but don't actually post the
data to the DataStore, instead return the data headers and rows that
would have been posted.
:type dry_run: boolean
"""
# Ensure temporary files are removed after run
with tempfile.TemporaryDirectory() as temp_dir:
return _push_to_datastore(task_id, input, dry_run=dry_run, temp_dir=temp_dir)
def _push_to_datastore(task_id, input, dry_run=False, temp_dir=None):
handler = util.StoringHandler(task_id, input)
...
Then all the temporary files can be created within that directory with sensible names like qsv_dedup.csv and omit all the os.unlink stuff. Whatever happens in the processing the with statement will remove the temporary files afterwards.
Describe the bug datapusher-plus leaves temporary files in /tmp, which accumulate until available space is depleted
To Reproduce Steps to reproduce the behavior:
Expected behavior After analysis is complete /tmp contains no temporary data files
Additional context I suspect this happens because the
NamedTemporaryFile
s are used mostly as temp file name generators instead of file objects, so the automatic deletion does not work.I propose using a
TemporaryDirectory
instead within a with statement which ensures it is deleted afterwards. The most ergonomic way of implementing this is splittingpush_to_datastore
into two functions: one to work as a@job.asynchronous
and context manager, the other to actually process the data like so:Then all the temporary files can be created within that directory with sensible names like
qsv_dedup.csv
and omit all theos.unlink
stuff. Whatever happens in the processing the with statement will remove the temporary files afterwards.