Open aphelionz opened 6 years ago
From @shawncrawley on June 21, 2016 23:20
@alvacouch @pkdash @hyi Do any of you have any leads on this?
From @pkdash on June 22, 2016 4:43
I am suspecting to obtain the size of a file that's on iRODS. Django might be copying the file which is an expensive operation. Once we implement file level metadata this won't be a problem.
From @mjstealey on June 22, 2016 11:19
to obtain the size of a file that's on iRODS
Can be done with iCommands ils command using the -l or -L flag. Reference: https://docs.irods.org/4.1.8/icommands/user/#ils
From @hyi on June 22, 2016 11:56
Yes, we currently do use icommand ils to obtain the size of a file stored in iRODS in our code, and it looks like the method being called by REST API in HS is doing the right thing: https://github.com/hydroshare/hydroshare/blob/develop/hs_core/views/resource_rest_api.py#L62. However, I am not sure whether REST python client inadvertently download the file as well when getting the resource file list, as shown here: https://github.com/hydroshare/hs_restclient/blob/9ca9ed608638a5c1012233f0dfffcd8bbdbe0131/hs_restclient/__init__.py#L249 and https://github.com/hydroshare/hs_restclient/blob/9ca9ed608638a5c1012233f0dfffcd8bbdbe0131/hs_restclient/__init__.py#L268 in the generator. @alvacouch @pkdash Any idea?
From @zhiyuli on July 8, 2016 15:48
@shawncrawley @hyi @pkdash @mjstealey I created two generic res to test this issue. One has only one res file, the other has 50 res files. On my end it takes less than 1 sec to open the first landing page, but about 9 seconds to open the second.
If I understand it correctly, when the landing page is about to show up, the backend make separate query against irods for each res file to retrieve file size. So for second res the backend has to pin irods 50 times before rendering its landing page.
Will storing the file size value in django db instead of querying irods make it faster?
The two testing res are here: (I didnt make these tesitng res public as I know some public presentations/demos are upcoming next week. I just shared it with HydroShare Developers Group)
landing page load test -- 1 file https://www.hydroshare.org/resource/4e1cf9a4706c4ba5a39d053d23122eb1/ landing page load test -- 50 files https://www.hydroshare.org/resource/fab56c38def24f01b42d40bd389dcf8d/
From @pkdash on July 8, 2016 17:31
@zhiyuli To get the file size icommand (ils) is used for each file. Not sure how expensive is that command. @hyi may know that. I would assume storing the file size in db would improve performance.
From @hyi on July 11, 2016 0:51
@zhiyuli @pkdash Yes, the file size is dynamically generated using iRODS ils command which is very fast, but for a 50-file resource, the accumulative times for iRODS size query for each resource file is not negligible. Agreed storing the file size in db would improve performance in this case. Since the discussion of file-level metadata is still ongoing, this issue can be taken into account in the design of file-level metadata such as file type, file size, etc.
From @hyi on August 1, 2016 15:26
@alvacouch I am assigning this to you initially since it is related to file level metadata (especially on file size) which you are leading, feel free to reassign as needed.
@hyi Caching the size of the file in Django will lead to distributed consistency problems. However, there are more efficient ways than individual icommands to measure this. I will look into this.
From @shawncrawley on June 21, 2016 21:18
I brought this up during a HydroShare call a few weeks ago. There is a button allowing users to add a resource file from HydroShare to their current GIS project. Thus, the app must generate a list of all resources available to them. It is beneficial to include the cumulative size of each resource's files so users can have an idea of wait time when adding the file. The problem is, the way that the resource files size is currently being obtained takes a lot of compute time. The resource list returns in a few seconds if I don't care about the file size; if I care about it, it takes nearly a minute.
I am currently doing this using the hs_restclient python module with the following code:
Is there a better way to go about this to cut down compute time?
Copied from original issue: hydroshare/hydroshare#1294