Filebrowser API does not scale

FNNDSC / ChRIS_ultron_backEnd

Backend for ChRIS

https://fnndsc.github.io/ChRIS_ultron_backEnd

MIT License

31 stars 100 forks source link

Filebrowser API does not scale #538

Closed jennydaman closed 4 weeks ago

jennydaman commented 6 months ago

Our internal deployment of CUBE currently has 317,246 PACSFiles. Trying to get the path SERVICES/PACS is very slow, even though it only has 3 subdirectories.

xh -a jennings.zhang:REDACTED rc-live.tch.harvard.edu:32000/api/v1/filebrowser/search/ accept:application/json path==SERVICES/PACS
HTTP/1.1 200 OK
Allow: GET
Connection: Keep-Alive
Content-Length: 311
Content-Type: application/json
Cross-Origin-Opener-Policy: same-origin
Date: Wed, 06 Mar 2024 18:08:51 GMT
Keep-Alive: timeout=2, max=100
Referrer-Policy: same-origin
Server: Apache
Vary: Accept,Origin
X-Content-Type-Options: nosniff
X-Frame-Options: DENY

{
    "count": 1,
    "next": null,
    "previous": null,
    "results": [
        {
            "path": "SERVICES/PACS",
            "subfolders": "[\"PACSDCM\", \"orthanc-galena\", \"orthanc-pangea\"]",
            "url": "http://rc-live.tch.harvard.edu:32000/api/v1/filebrowser/SERVICES/PACS/",
            "files": "http://rc-live.tch.harvard.edu:32000/api/v1/filebrowser-files/SERVICES/PACS/"
        }
    ]
}

Response takes 30-120 seconds.

jennydaman commented 6 months ago

This behavior is inconsistent with how the filebrowser works for other paths, e.g.

xh -a jennings.zhang:REDACTED rc-live.tch.harvard.edu:32000/api/v1/filebrowser/search/ accept:application/json path==SERVICES/PACS/PACSDCM/A_SPECIFIC_PATIENT

This response is <1 second.

rudolphpienaar commented 6 months ago

I'm not sure this is inconsistent. I understand that when trying to construct the dir "structure" a given PATH -- all files under that path need to be "collected/indexed" merely to build the tree structure. So SERVICES/PACS will need to collect all PACS files, while SERIVCES/PACS/PACSDCM/A_SPECIFIC_PATIENT will only need to collect and index a much smaller subset.

jennydaman commented 6 months ago

That is a good explanation to how things are working behind the scenes. Nonetheless, it's still a performance scaling issue.

As a client, I would be surprised it takes more than a minute to list 3 folders.

More info: in Grafana, we can see a 6GiB spike in memory usage at the time I did that request:

rudolphpienaar commented 6 months ago

Yes, unfortunately (again as I understand it) this is a consequence of the current folder-less internal data organization. One cannot be sure of the implicit dir structure until every filename under a path has been retrieved.

jbernal0019 commented 6 months ago

Correct. Performance will improve a lot with the explicit modeling of folders in the new CUBE that is coming soon.

jennydaman commented 4 weeks ago

Resolved by https://github.com/FNNDSC/ChRIS_ultron_backEnd/pull/545