dathere / datapusher-plus

A standalone web service that pushes data into the CKAN Datastore fast & reliably. It pushes real good!
GNU Affero General Public License v3.0
27 stars 21 forks source link

Temporary files are not always removed #116

Closed bzar closed 1 year ago

bzar commented 1 year ago

Describe the bug datapusher-plus leaves temporary files in /tmp, which accumulate until available space is depleted

To Reproduce Steps to reproduce the behavior:

  1. Process various files with datapusher-plus
  2. Observe /tmp

Expected behavior After analysis is complete /tmp contains no temporary data files

Additional context I suspect this happens because the NamedTemporaryFiles are used mostly as temp file name generators instead of file objects, so the automatic deletion does not work.

I propose using a TemporaryDirectory instead within a with statement which ensures it is deleted afterwards. The most ergonomic way of implementing this is splitting push_to_datastore into two functions: one to work as a @job.asynchronous and context manager, the other to actually process the data like so:

@job.asynchronous
def push_to_datastore(task_id, input, dry_run=False):
    """Download and parse a resource push its data into CKAN's DataStore.

    An asynchronous job that gets a resource from CKAN, downloads the
    resource's data file and, if the data file has changed since last time,
    parses the data and posts it into CKAN's DataStore.

    :param dry_run: Fetch and parse the data file but don't actually post the
        data to the DataStore, instead return the data headers and rows that
        would have been posted.
    :type dry_run: boolean

    """

    # Ensure temporary files are removed after run
    with tempfile.TemporaryDirectory() as temp_dir:
        return _push_to_datastore(task_id, input, dry_run=dry_run, temp_dir=temp_dir)

def _push_to_datastore(task_id, input, dry_run=False, temp_dir=None):
    handler = util.StoringHandler(task_id, input)
    ...

Then all the temporary files can be created within that directory with sensible names like qsv_dedup.csv and omit all the os.unlink stuff. Whatever happens in the processing the with statement will remove the temporary files afterwards.

jqnatividad commented 1 year ago

Thanks @bzar for the suggestion!

Your proposed approach is certainly much better and simplifies the code.