box / box-python-sdk

Box SDK for Python
http://opensource.box.com/box-python-sdk/
Apache License 2.0
418 stars 215 forks source link

Folder partitioning #755

Closed GavinReynolds closed 2 years ago

GavinReynolds commented 2 years ago

Hi all,

I am implementing Box as a file repository in to a task application I have developed for a large organisation. The application handles around 50,000 requests per annum with at least 2 files per task saved in Box.

There is a retention period for each file for 6 years so I do not think the ideal solution would be to store every file at the root directory especially as there is a user requirement for users to visit any task and query Box for files associated to the task. If

Therefore, I am looking to partition the files by year and month i.e. ROOT > 2022 > August > task folder

But, I do not want to keep hammering api requests to Box using the get_items() method.

I would have to use get_items() method twice to check if a folder named after a task reference exists for the current month and if it doesn't the application would create one. e.g. get_items() for root containing every folder named after year followed by get_items() at year for every folder named after month. Then, if a folder named after the task reference does not exist in the current month, use the create_subfolder() method to create one.

I know there is a filter using the search() method but it seems that I would require the folder id associated to all folders named after year and folders named after month which, other than the root folder id, the application would not know (only the folder name).

Is there a solution to making queries for files/folders with this kind of structure without knowing the folder id?

Or, I may be wrong in thinking I could be creating an overhead by having many thousands of folders created at root.

I hope the above makes sense.

Thank you all in advance.

Gavin.

lukaszsocha2 commented 2 years ago

Hi @GavinReynolds, currently Box doesn't provide API for getting file/folder id by its path - it is required to use get_items() the way you mentioned. However I see a few possible solution for this:

  1. Keep everything in root folder. This is the worst solution from user perspective - hard to find anything there. Also when you would like to get any folder or file using get_items()it will take more time as more items will be there to iterate through.
  2. Create the structure you proposed (ROOT > 2022 > August > task folder) and cache a dictionary with ids of the folders. You can easily store it the using shelve library. So you can create a dictionary with structure: {2022: {'August': '612516526', July: '265362536'}, 2021: ...}. Then you will be able to get id of any folder easily.
  3. You can create structure as you mentioned, but also month folder would contain a year, e.g. ROOT > 2022 > August-2022 > task folder. Then you can use search() to find a folder with given name - August-2022.
  4. Create the structure you proposed (ROOT > 2022 > August > task folder) and use search() to find a month and then check parent folder name similar to the example below (some additional checks may be required):
    for item in user_client.search().query(query='July'):
            if item.parent.name == '2022':
                print(item.object_id)

    If the retention is 6 years then it should iterate through 6 folders in the worst case scenario.

Hope one of the above solutions will work for you. If you have any additional questions don't hesitate to ask. Best, @lukaszsocha2

GavinReynolds commented 2 years ago

@lukaszsocha2 Thank you very much for your prompt response yesterday along with your solutions. I have included the shelve library and it works great especially because I now only need to use create_subfolder() once per year and month level if they do not appear in the dictionary then once for every task folder created. And of course get_items() only if the user requests files from the task folder. Thanks again, Gavin.