audeering / audbackend

Manage file storage on different backends
https://audeering.github.io/audbackend/
Other
3 stars 0 forks source link

Avoid using glob() in Artifactory backend #132

Open hagenw opened 1 year ago

hagenw commented 1 year ago

For the implementtion of audbackend.Artifactory.ls() we switched to use glob('*/**') in the dev branch, which is very slow. In main we avoided using glob() in ls(), which we should also do in dev.

frankenjoe commented 1 year ago

Yes, I hope there is a good way to do so. Cause in the old implementation we were only returning the files in the folder, but since we decided that the concept of folders must not exist on a backend, we now return all files that have a backend path that starts with the requested sub-path.

frankenjoe commented 1 year ago

Ok, seems we should simply do:

backend = audbackend.access(
    'artifactory',
    'https://audeering.jfrog.io/artifactory',
    'data-public',
)

for p in backend._repo:
    print(p)

It returns all files in the repository and is really fast.

frankenjoe commented 1 year ago

Unfortunately, it seems the solution I proposed is not working with anonymous access at the moment. They might change it in the future, though:

https://jfrog.atlassian.net/browse/RTFACT-7384

For the record, this is the commit where I replaced glob() with simply listing everything in the repository:

https://github.com/audeering/audbackend/pull/130/commits/5f2ccc55bb7414be11eb47cbb61d7bf8898e7a01

Under the hood it does the following:

import audbackend

backend = audbackend.access(
    'artifactory',
    'https://audeering.jfrog.io/artifactory',
    'data-public',
)
backend._use_legacy_file_structure()

query = [
    'items.find',
    {
        'repo': {'$eq': 'data-public'},
        'type': 'file',
    },
    '.include',
    ['name'],
]
backend._repo.aql(*query)
[{'name': '005d2b91-5317-0c80-d602-6d55f0323f8c-1.1.0.zip'}, {'name': '014f82d8-3491-fd00-7397-c3b2ac3b2875-1.1.0.zip'}, {'name': '0241da63-dc03-ccb3-623c-80020b913b0d-1.1.0.zip'}, {'name': '02ae4ee2-063a-b720-897e-12400ea54f89-1.1.0.zip'}, {'name': '03ac9ce7-461e-9991-b925-89c2697d7f38-1.1.0.zip'}, {'name': '040df0a2-bdbf-6590-f86a-947f94ae7e12-1.1.0.zip'}, {'name': '0415d83a-9d8e-2cc6-86b0-2b0649c4e67d-1.1.0.zip'},`...]

But returns 403 for an anonymous user (even if the repository allows anonymous access).

hagenw commented 1 year ago

As can be seen in https://github.com/audeering/audb/commit/1e5a0cfb83613d4bebf31cf5bc33913905d61baf the following works:

backend = audbackend.access(
    'artifactory',
    'https://audeering.jfrog.io/artifactory',
    'data-public',
)

for p in backend._repo.path:
    print(p)
frankenjoe commented 1 year ago

It's not recursive.

hagenw commented 1 year ago

You can do this recursive by looping as long as it finds paths. But maybe that is as slow as glob() then?

hagenw commented 10 months ago

It's also mentioned at https://github.com/devopshq/artifactory/issues/423 that glob() is slow, especially compared to the underlying API call. So maybe we could also fix it by fixing the original code inside dohq-artifactory.