atlab / scanreader

Python TIFF Stack Reader for ScanImage 5 scans (including multiROI).
MIT License
9 stars 11 forks source link

Improve speed when opening tiff files over the network #5

Open ecobost opened 5 years ago

ecobost commented 5 years ago

After opening a file, if a user tries to access the num_frames of a scan tifffile will iterate over each page to find their offsets (see step 2 in the Details of data loading section in the readme). This turns out to be very slow when done over the network (almost 200x slower than when the file is local):

In [13]: f2 = tifffile.TiffFile('/mnt/scratch06/Two-Photon/taliah/2019-04-03_12-41-44/21067_10_00003_00001.tif')   # over the network                                                                                                 

In [14]: cProfile.run('n2 = len(f2.pages)')                                                                                                                                                                        
         240111 function calls (240109 primitive calls) in 28.641 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   28.641   28.641 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 tifffile.py:2035(filehandle)
        1    0.287    0.287   28.641   28.641 tifffile.py:3375(_seek)
        1    0.000    0.000   28.641   28.641 tifffile.py:3567(__len__)
    40000    0.053    0.000   28.080    0.001 tifffile.py:5570(read)
    40001    0.065    0.000    0.209    0.000 tifffile.py:5662(seek)
    19999    0.010    0.000    0.010    0.000 tifffile.py:5704(size)
        1    0.000    0.000    0.000    0.000 tifffile.py:5708(closed)
    40000    0.049    0.000    0.049    0.000 {built-in method _struct.unpack}
        1    0.000    0.000   28.641   28.641 {built-in method builtins.exec}
      101    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      3/1    0.000    0.000   28.641   28.641 {built-in method builtins.len}
    19999    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    40000   28.027    0.001   28.027    0.001 {method 'read' of '_io.BufferedReader' objects}
    40001    0.144    0.000    0.144    0.000 {method 'seek' of '_io.BufferedReader' objects}

In [18]: f3 = tifffile.TiffFile('/data/pipeline/21067_10_00003_00001.tif')   # local                                                                                                                                      

In [19]: cProfile.run('n2 = len(f3.pages)')                                                                                                                                                                        
         240111 function calls (240109 primitive calls) in 0.154 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.154    0.154 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 tifffile.py:2035(filehandle)
        1    0.046    0.046    0.154    0.154 tifffile.py:3375(_seek)
        1    0.000    0.000    0.154    0.154 tifffile.py:3567(__len__)
    40000    0.011    0.000    0.062    0.000 tifffile.py:5570(read)
    40001    0.014    0.000    0.036    0.000 tifffile.py:5662(seek)
    19999    0.003    0.000    0.003    0.000 tifffile.py:5704(size)
        1    0.000    0.000    0.000    0.000 tifffile.py:5708(closed)
    40000    0.006    0.000    0.006    0.000 {built-in method _struct.unpack}
        1    0.000    0.000    0.154    0.154 {built-in method builtins.exec}
      101    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
      3/1    0.000    0.000    0.154    0.154 {built-in method builtins.len}
    19999    0.002    0.000    0.002    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    40000    0.051    0.000    0.051    0.000 {method 'read' of '_io.BufferedReader' objects}
    40001    0.023    0.000    0.023    0.000 {method 'seek' of '_io.BufferedReader' objects}

The chain of operations goes scan.num_frames -> len(TiffFile.pages) -> TiffFile.TiffPages.seek(-1). What seek(-1) does is starting on the first page which has already been read, move page by page accessing their offset value and saving it in an index. Per page, it performs two seeks and two reads on the tiff file handle (which is an io.BufferedReader object); these reads take most of the time.

However, they only read 8 bytes each (fh.read(tagnosize) reads the number of tags and fh.read(offsetsize) reads the actual offset) which doesn't account to enough info for it to be a bottleneck (even assuming each 8 byte is packeted as a 96 byte TCP packet, that is only around 4 Mb which would not take 28 seconds). My guess is that it is the sheer number of packets that is causing the problem.

In any way, because all of ScanImage's tiff files' pages are the same size on file, the offset from page to page will be exactly the same so we only need to compute one offset overall (or maybe one per file to be safe and avoid read errors if two files come from diff scans). This will require changing the seek function in tifffile.TiffPages to only compute the offset once and fill out the rest of page offsets with it.

cgohlke commented 3 years ago

because all of ScanImage's tiff files' pages are the same size on file, the offset from page to page will be exactly the same so we only need to compute one offset overall

FWIW, this is not true for ScanImage > 2015 BigTIFF files, where the ImageDescription tag value varies. See also https://github.com/cgohlke/tifffile/issues/29.

ecobost commented 3 years ago

Hi @cgohlke Tags changing size will be annoying, I would have to check when that is the case. I remember checking offsets for some test cases and they were the same but maybe it changes for some configs. At least, it will be patently obvious if the offsets are wrong (all kinds of stuff should break). Thanks for letting us know :+1:

PS: Not sure why that would have been a problem in the referenced issue, I thought tifffile explicitly reads the offsets page by page (even if is_scanimage is True), that's what this issue was supposed to be about.