CellProfiler / python-bioformats

Read and write life sciences file formats
Other
131 stars 45 forks source link

Inefficient retrieval of WellSample attribute values #44

Open hackermd opened 8 years ago

hackermd commented 8 years ago

Getting attribute values of a WellSample object is very inefficient. This becomes a major bottleneck when working with a large number of images, for example in case of a 384 well plate.

Here is small test program that demonstrates the problems and a potential solution:

#!/usr/bin/env python
import bioformats
import itertools
import logging
import argparse
import time

FORMAT = '%(asctime)-15s | %(message)s'
logging.basicConfig(format=FORMAT)

logger = logging.getLogger('test')

logging_level_verbosity = {
    0: logging.CRITICAL,
    1: logging.INFO,
    2: logging.DEBUG,
}

if __name__ == '__main__':

    parser = argparse.ArgumentParser('bioformats test program')
    parser.add_argument('-v', dest='verbosity', action='count', default=0,
                        help='increase logging verbosity')
    parser.add_argument('--hashing', action='store_true',
                        help='store reference image IDs in dictionary')

    args = parser.parse_args()
    logger.setLevel(logging_level_verbosity[args.verbosity])

    start_time_1 = time.time()

    metadata = bioformats.OMEXML()

    logger.info('prepare first plate')

    plate = metadata.PlatesDucktype(metadata.root_node).newPlate(name='default')
    plate.RowNamingConvention = 'letter'
    plate.ColumnNamingConvention = 'number'
    plate.Rows = 16
    plate.Columns = 24
    n_samples = 48

    logger.info('process first plate')
    well_count = 0
    sample_count = 0
    lookup = dict()
    for r, c in itertools.product(range(plate.Rows), range(plate.Columns)):
        logger.info('process well # %d: %s%.2d', well_count+1, chr(r+1+64), c+1)
        well = metadata.WellsDucktype(plate).new(row=r, column=c)
        samples = metadata.WellSampleDucktype(well.node)
        for s in xrange(n_samples):
            logger.debug('process sample # %d', s)
            samples.new(index=s)
            samples[s].ImageRef = 'Image:%d' % sample_count
            # Store the attribute value in a dictionary for subsequent hashing
            lookup[(r, c, s)] = samples[s].ImageRef
            sample_count += 1
        well_count += 1

    end_time_1 = time.time() - start_time_1
    logger.info('processing first plate took %s seconds', end_time_1)

    start_time_2 = time.time()

    logger.info('prepare second plate')

    new_metadata = bioformats.OMEXML()

    new_plate = new_metadata.PlatesDucktype(
                        new_metadata.root_node).newPlate(name='default')
    new_plate.RowNamingConvention = 'letter'
    new_plate.ColumnNamingConvention = 'number'
    new_plate.Rows = plate.Rows
    new_plate.Columns = plate.Columns
    n_samples = len(plate.Well[0].Sample)

    logger.info('process second plate')
    for w, well_id in enumerate(plate.Well):
        logger.info('process well # %d: %s', w+1, well_id)
        well = new_metadata.WellsDucktype(plate).new(
                        row=plate.Well[w].Row, column=plate.Well[w].Column)
        samples = new_metadata.WellSampleDucktype(well.node)
        for s in xrange(n_samples):
            logger.debug('process well sample # %d', s)
            samples.new(index=s)
            if args.hashing:
                samples[s].ImageRef = lookup[(well.Row, well.Column, s)]
            else:
                samples[s].ImageRef = plate.Well[w].Sample[s].ImageRef

    end_time_2 = time.time() - start_time_2
    logger.info('processing second plate took %s seconds', end_time_2)

Running the program with --hashing argument

test_bioformats.py -v --hashing

takes less than 5 minutes

2015-12-02 11:04:13,044 | processing second plate took 206.617892981 seconds

while running it in default mode

test_bioformats.py -v

takes almost 1 hour

2015-12-02 12:17:57,098 | processing second plate took 3459.09807611 seconds

I assume the problem lies in the implementation of the __getitem__ method of the WellDucttype and WellSampleDucktype classes.

hackermd commented 8 years ago

The major bottleneck is the __getitem__ method of the WellDucttype class.

Changing the code in the above example to:

    logger.info('process second plate')
    for w, well_id in enumerate(plate.Well):
        logger.info('process well # %d: %s', w+1, well_id)
        well = new_metadata.WellsDucktype(plate).new(
                        row=plate.Well[w].Row, column=plate.Well[w].Column)
        samples = new_metadata.WellSampleDucktype(well.node)
        currently_processed_well = plate.Well[w]  # store the well object
        for s in xrange(n_samples):
            logger.debug('process well sample # %d', s)
            samples.new(index=s)
            if args.hashing:
                samples[s].ImageRef = lookup[(well.Row, well.Column, s)]
            else:
                samples[s].ImageRef = currently_processed_well.Sample[s].ImageRef

boosts the performance of

test_bioformats.py -v

to

2015-12-02 12:39:40,283 | processing second plate took 284.545585871 seconds

The for loops in __getitem__ should probably be replaced by hashing.

LeeKamentsky commented 8 years ago

Thanks, Markus, You're right about the caching speedup. I was aiming for simplicity and a low memory footprint there. The problem with a cache here is that the underlying XML document could conceivably be changed and the cache would not reflect the changes. The other problem is that iter returns well names and in your case, the problem would have been solved if it returned wells instead. What do you think about adding a function to WellsDucktype that would return an iterator over Well instances (WellsDucktype.get_wells() and WellSampleDucktype.get_samples())? The code would then look like this:

for w, src_well in enumerate(plate.Well.get_wells()):
    logger.info('process well # %d: %s', w+1, src_well.ID)
    well = new_metadata.WellsDucktype(plate).new(
                    row=src_well.Row, column=src_well.Column)
    samples = new_metadata.WellSampleDucktype(well.node)
    for s, src_sample in enumerate(src_well.get_samples()):
        logger.debug('process well sample # %d', s)
        sample = samples.new(index=s) # needs WellSampleDucktype.new() to return sample
        sample.ImageRef = src_sample.ImageRef

Realistically, I may have some trouble finding time to do this, but I'd take a pull request for the changes needed.

hackermd commented 8 years ago

Thank you for your immediate feedback!

I would welcome an __iter__ method that returns well objects.

Since WellsDucktype inherits from dict a dict-like iterator would be the most intuitive in my opinion:

for well_id, src_well in plate.Well.iteritems():
     well = new_metadata.WellsDucktype(plate).new(
                    row=src_well.Row, column=src_well.Column)
    ...
hackermd commented 8 years ago

This should also be compatible with the current implementation, since

for well_id in plate.Well:
    src_well = plate.Well[well_id]
    ...

would still work.

hackermd commented 8 years ago

The coupling with the underlying XML document is a nice feature, in particular because of the to_xml(). However, it also brings some problems with it. Do you think one could uncouple it and separate XML parsing/dumping from the Python object structure? I would be happy to help.