HydroShare GIS app - more efficient way to access cumulative size of resource files

aphelionz commented 6 years ago

From @shawncrawley on June 21, 2016 21:18

I brought this up during a HydroShare call a few weeks ago. There is a button allowing users to add a resource file from HydroShare to their current GIS project. Thus, the app must generate a list of all resources available to them. It is beneficial to include the cumulative size of each resource's files so users can have an idea of wait time when adding the file. The problem is, the way that the resource files size is currently being obtained takes a lot of compute time. The resource list returns in a few seconds if I don't care about the file size; if I care about it, it takes nearly a minute.

I am currently doing this using the hs_restclient python module with the following code:

from hs_restclient import HydroShare
hs = HydroShare()
res_list = []
for res in hs.getResourceList():
    res_id = res['resource_id']
    res_size = 0
    try:
        for res_file in hs.getResourceFileList(res_id):
            res_size += res_file['size']
    except HydroShareNotAuthorized:
        continue
    except Exception as e:
        print e

    res_list.append({
        'title': res['resource_title'],
        'type': res['resource_type'],
        'id': res_id,
        'size': res_size,
        'owner': res['creator']
    })

Is there a better way to go about this to cut down compute time?

Copied from original issue: hydroshare/hydroshare#1294

aphelionz commented 6 years ago

From @shawncrawley on June 21, 2016 23:20

@alvacouch @pkdash @hyi Do any of you have any leads on this?

aphelionz commented 6 years ago

From @pkdash on June 22, 2016 4:43

I am suspecting to obtain the size of a file that's on iRODS. Django might be copying the file which is an expensive operation. Once we implement file level metadata this won't be a problem.

aphelionz commented 6 years ago

From @mjstealey on June 22, 2016 11:19

to obtain the size of a file that's on iRODS

Can be done with iCommands ils command using the -l or -L flag. Reference: https://docs.irods.org/4.1.8/icommands/user/#ils

aphelionz commented 6 years ago

From @hyi on June 22, 2016 11:56

Yes, we currently do use icommand ils to obtain the size of a file stored in iRODS in our code, and it looks like the method being called by REST API in HS is doing the right thing: https://github.com/hydroshare/hydroshare/blob/develop/hs_core/views/resource_rest_api.py#L62. However, I am not sure whether REST python client inadvertently download the file as well when getting the resource file list, as shown here: https://github.com/hydroshare/hs_restclient/blob/9ca9ed608638a5c1012233f0dfffcd8bbdbe0131/hs_restclient/__init__.py#L249 and https://github.com/hydroshare/hs_restclient/blob/9ca9ed608638a5c1012233f0dfffcd8bbdbe0131/hs_restclient/__init__.py#L268 in the generator. @alvacouch @pkdash Any idea?

aphelionz commented 6 years ago

From @zhiyuli on July 8, 2016 15:48

@shawncrawley @hyi @pkdash @mjstealey I created two generic res to test this issue. One has only one res file, the other has 50 res files. On my end it takes less than 1 sec to open the first landing page, but about 9 seconds to open the second.

If I understand it correctly, when the landing page is about to show up, the backend make separate query against irods for each res file to retrieve file size. So for second res the backend has to pin irods 50 times before rendering its landing page.

Will storing the file size value in django db instead of querying irods make it faster?

The two testing res are here: (I didnt make these tesitng res public as I know some public presentations/demos are upcoming next week. I just shared it with HydroShare Developers Group)

landing page load test -- 1 file https://www.hydroshare.org/resource/4e1cf9a4706c4ba5a39d053d23122eb1/ landing page load test -- 50 files https://www.hydroshare.org/resource/fab56c38def24f01b42d40bd389dcf8d/

aphelionz commented 6 years ago

From @pkdash on July 8, 2016 17:31

@zhiyuli To get the file size icommand (ils) is used for each file. Not sure how expensive is that command. @hyi may know that. I would assume storing the file size in db would improve performance.

aphelionz commented 6 years ago

From @hyi on July 11, 2016 0:51

@zhiyuli @pkdash Yes, the file size is dynamically generated using iRODS ils command which is very fast, but for a 50-file resource, the accumulative times for iRODS size query for each resource file is not negligible. Agreed storing the file size in db would improve performance in this case. Since the discussion of file-level metadata is still ongoing, this issue can be taken into account in the design of file-level metadata such as file type, file size, etc.

aphelionz commented 6 years ago

From @hyi on August 1, 2016 15:26

@alvacouch I am assigning this to you initially since it is related to file level metadata (especially on file size) which you are leading, feel free to reassign as needed.

alvacouch commented 6 years ago

@hyi Caching the size of the file in Django will lead to distributed consistency problems. However, there are more efficient ways than individual icommands to measure this. I will look into this.

CUAHSI-APPS / tethysapp-hydroshare_gis

HydroShare GIS app - more efficient way to access cumulative size of resource files #19