ckan / ckanext-archiver

Archive CKAN resources
MIT License
21 stars 46 forks source link

Large File leak in tasks._save_resource #65

Open EricSoroos opened 5 years ago

EricSoroos commented 5 years ago

Here: https://github.com/ckan/ckanext-archiver/blob/master/ckanext/archiver/tasks.py#L734

def _save_resource(resource, response, max_file_size, chunk_size=1024*16):
    """
    Write the response content to disk.
    Returns a tuple:
        (file length: int, content hash: string, saved file path: string)
    """
    resource_hash = hashlib.sha1()
    length = 0

    fd, tmp_resource_file_path = tempfile.mkstemp()

    with open(tmp_resource_file_path, 'wb') as fp:
        for chunk in response.iter_content(chunk_size=chunk_size,
                                           decode_unicode=False):
            fp.write(chunk)
            length += len(chunk)
            resource_hash.update(chunk)

            if length >= max_file_size:
                raise ChooseNotToDownload(
                    _("Content-length %s exceeds maximum allowed value %s") %
                    (length, max_file_size))

    os.close(fd)

    content_hash = unicode(resource_hash.hexdigest())
    return length, content_hash, tmp_resource_file_path

If the file is too large, it raises an error but there is not enough information in the exception to clean up the file.

Unfortunately, this means that "too large" resources will accumulate in the /tmp directory over time.