cthoyt / zenodo-client

A tool for automated uploading and version management of scientific data to Zenodo
MIT License
25 stars 5 forks source link

Add progress bar for uploads #15

Open sgbaird opened 1 year ago

sgbaird commented 1 year ago

Could also just be a tqdm through the different files, though obviously a bit less informative.

cthoyt commented 1 year ago

Here's the code that does uploads, if you can figure out how to make this work with tqdm then I would be happy to accept a PR

https://github.com/cthoyt/zenodo-client/blob/a97de70673d459095e2b0bfd8569ffa009a5d236/src/zenodo_client/api.py#L200-L214

sgbaird commented 1 year ago

Thank you! I may come back to this with a PR. As an off-hand estimate, uploading https://doi.org/10.5281/zenodo.7693716 (6.3 GB) took 47 minutes for me with ~40 Mbps upload speed per Google's internet speed test.

40 Mbps = 5 MBps
6.3 GB / 5 MB / 60 = 21 minutes

So, it took ~2.2x longer than the estimated time. One option may be to check the file size, check the upload speed using the speedtest package, and apply some safety factor for the estimate at each iteration.

cthoyt commented 1 year ago

Please do not make a PR that includes additional dependencies, especially to external web services.

Reasons:

  1. More things can go wrong with external dependencies
  2. People who rely on Zenodo Client probably don't want to get a bunch of extras (it's pretty lean now)
  3. External web services could go down / have limitations (I have a feeling that somehow speedtest wants to make money, and could imagine this negatively impacting users of this package)
daviddavo commented 3 months ago

A while ago I made the following code in a jupyter notebook that might be relevant for this issue:

# https://stackoverflow.com/a/64423275/4505998
class UploadChunksIterator(Iterable):
    """
    This is an interface between python requests and tqdm.
    Make tqdm to be accessed just like IOBase for requests lib.
    """
    def __init__(
        self, file: io.BufferedReader, total_size: int, chunk_size: int = 16 * 1024
    ):  # 16MiB
        self.file = file
        self.chunk_size = chunk_size
        self.total_size = total_size

    def __iter__(self):
        return self

    def __next__(self):
        data = self.file.read(self.chunk_size)
        if not data:
            raise StopIteration
        return data

    # we dont retrive len from io.BufferedReader because CallbackIOWrapper only has read() method.
    def __len__(self):
        return self.total_size

class ZenodoPlus(Zenodo):
    def _upload_big_files(self, *, bucket: str, paths, done=[]) -> List[requests.Response]:
        _paths = [paths] if isinstance(paths, (str, Path)) else paths
        _paths = [ Path(p) for p in _paths]
        _done_dict = { f['filename'] : f for f in done }

        rv = []
        # see https://developers.zenodo.org/#quickstart-upload
        for path in _paths:
            total_size = path.stat().st_size

            if path.name in _done_dict:
                _info = _done_dict[path.name]
                skip = False
                md5sum = big_hash(path)

                if _info['filesize'] == total_size and _info['checksum'] == md5sum:
                    print("Skipping already uploaded file", path, "with hash", md5sum)
                    continue

                raise
                if skip: continue

            with open(path, "rb") as file:
                file = tqdm.wrapattr(
                    file, 
                    "read",
                    miniters=1, 
                    total=total_size, 
                    unit='B',
                    unit_scale=True,
                    unit_divisor=1024,
                )

                with file as f:
                    res = requests.put(
                        f"{bucket}/{path.name}",
                        data=UploadChunksIterator(f, total_size),
                        params={"access_token": self.access_token},
                        # headers={'Content-Type': m.content_type},
                    )

            res.raise_for_status()
            rv.append(res)
        return rv

    def upload_to_record(self, deposition_id: str, paths):
        url = f"{self.depositions_base}/{deposition_id}"
        res = requests.get(url, params={"access_token": self.access_token})
        res.raise_for_status()

        deposition_data = res.json()

        bucket = deposition_data['links']['bucket']
        self._upload_big_files(bucket=bucket, paths=paths, done=deposition_data['files'])

Z = ZenodoPlus(sandbox=ZENODO_SANDBOX)
Z.upload_to_record(ZENODO_NOMODELS_ID, ["ray_results_nomodels.tar.xz"])
print(f"Remember to publish {Z.depositions_base}/{ZENODO_NOMODELS_ID}")

I don't have the time to make a PR, but if someone wants to use it, feel free to use it, is under MIT license