loading all metadata during Table.from_hdf5 is very slow

adamrp commented 10 years ago

Not immediately sure how best to improve this. While working on an issue with @ElDeveloper, @wasade, and @antgonza, noticed that loading all metadata (basically these lines) takes a long time. In this specific case (where the only metadata was taxonomy), load_table took 15.5 seconds on my laptop. When I do not load the metadata, load time drops to about half of a second. We might want to add private functions for loading metadata (that can be called by Table.metadata), where the private methods' implementations depend on whether or not the file is HDF5 or not -- if the table is HDF5, we might make gains by memory mapping the operation.

ElDeveloper commented 10 years ago

:+1:

On (Aug-12-14|18:42), adamrp wrote:

Not immediately sure how best to improve this. While working on an issue with @ElDeveloper, @wasade, and @antgonza, noticed that loading all metadata (basically these lines) takes a long time. In this specific case (where the only metadata was taxonomy), load_table took 15.5 seconds on my laptop. When I do not load the metadata, load time drops to about half of a second. We might want to add private functions for loading metadata (that can be called by Table.metadata), where the private methods' implementations depend on whether or not the file is HDF5 or not -- if the table is HDF5, we might make gains by memory mapping the operation.

Reply to this email directly or view it on GitHub: https://github.com/biocore/biom-format/issues/529

wasade commented 10 years ago

@josenavas, did you notice any performance differences between 2.0 and 2.1?

@adamrp, what size table? I don't understand how private functions would change anything there as those are already local functions.

It would of course be terrible if JSON parsing was faster than loading a HDF5 dataset. It looks like that code is loading metadata per ID, instead of all of the metadata at once. That is going to void any potential memory mapped benefit I think, so it could just be that we should load in full first and than subset (if needed) after on the assumption that metadata will always be a minor memory footprint and quick to bulk load.

On Tue, Aug 12, 2014 at 7:47 PM, Yoshiki Vázquez Baeza < notifications@github.com> wrote:

:+1:

On (Aug-12-14|18:42), adamrp wrote:

Not immediately sure how best to improve this. While working on an issue with @ElDeveloper, @wasade, and @antgonza, noticed that loading all metadata (basically these lines) takes a long time. In this specific case (where the only metadata was taxonomy), load_table took 15.5 seconds on my laptop. When I do not load the metadata, load time drops to about half of a second. We might want to add private functions for loading metadata (that can be called by Table.metadata), where the private methods' implementations depend on whether or not the file is HDF5 or not -- if the table is HDF5, we might make gains by memory mapping the operation.

Reply to this email directly or view it on GitHub: https://github.com/biocore/biom-format/issues/529

— Reply to this email directly or view it on GitHub https://github.com/biocore/biom-format/issues/529#issuecomment-52000715.

adamrp commented 10 years ago

@wasade, I didn't mean to imply that the intended scope of the function was the important part, but rather that the Table.metadata function would need to behave differently (i.e., returning the requested metadata would have to call different functions) depending on whether or not the on-disk representation was HDF5 or not.

For example, the current implementation (see this line) assumes all the metadata is in memory. If we were going to speed up load_table for HDF5 biom tables by not loading all of the metadata upon load_table, the Table object would have to know whether or not the on-disk representation was HDF5 or "other" (so we'd probably want a new attribute -- or attributes -- in the Table class keeping track of the on-disk format and, in the case it is HDF5, the h5py.File object). If "other" (JSON/TSV), it would behave as it currently does; but if HDF5 then it could load the metadata from disk as needed, possibly with memoizing it so that it doesn't load it from disk more than once and so that it can keep track of any changes that are made during current processing.

so it could just be that we should load in full first and than subset (if needed) after on the assumption that metadata will always be a minor memory footprint and quick to bulk load.

I could be wrong, but it looks to me like it actually is loading all the metadata at once (which is reinforced by the comment on line 2986), which is in this case the most time-consuming step.

adamrp commented 10 years ago

Oh, and the table was about 900 samples and 19,000 OTUs.

wasade commented 10 years ago

The for loop is per id and indexes in to each metadata dataset that is per metadata type. It is loading everything, just not in full or with fancy indexing. Just to verify, this was profiled, right?

We currently do not do lazy eval for metadata or data. If we load in bulk or with fancy indexing, then I don't think there will be a big performance drain as biom 2.0 was fast. If it remains slow, then lazy eval and the associated complexity may be worth it On Aug 13, 2014 7:34 AM, "adamrp" notifications@github.com wrote:

@wasade https://github.com/wasade, I didn't mean to imply that the intended scope of the function was the important part, but rather that the Table.metadata function would need to behave differently (i.e., returning the requested metadata would have to call different functions) depending on whether or not the on-disk representation was HDF5 or not.

For example, the current implementation (see this line https://github.com/biocore/biom-format/blob/master/biom/table.py#L916) assumes all the metadata is in memory. If we were going to speed up load_table for HDF5 biom tables by not loading all of the metadata upon load_table, the Table object would have to know whether or not the on-disk representation was HDF5 or "other" (so we'd probably want a new attribute -- or attributes -- in the Table class keeping track of the on-disk format and, in the case it is HDF5, the h5py.File object). If "other" (JSON/TSV), it would behave as it currently does; but if HDF5 then it could load the metadata from disk as needed, possibly with memoizing it so that it doesn't load it from disk more than once and so that it can keep track of any changes that are made during current processing.

so it could just be that we should load in full first and than subset (if needed) after on the assumption that metadata will always be a minor memory footprint and quick to bulk load.

I could be wrong, but it looks to me like it actually is loading all the metadata at once (which is reinforced by the comment on line 2986 https://github.com/biocore/biom-format/blob/master/biom/table.py#L2986), which is in this case the most time-consuming step.

— Reply to this email directly or view it on GitHub https://github.com/biocore/biom-format/issues/529#issuecomment-52048075.

wasade commented 10 years ago

I realize BTW that the mention about mmap may have confused this, what I meant was with hdf5s mmap of datasets which is voided when you do single indexing over multiple datasets (eg for id; for dataset;) which is the current approach. BIOM itself does not do anything mmap or have support On Aug 13, 2014 8:03 AM, "Daniel T. McDonald" Daniel.Mcdonald@colorado.edu wrote:

The for loop is per id and indexes in to each metadata dataset that is per metadata type. It is loading everything, just not in full or with fancy indexing. Just to verify, this was profiled, right?

We currently do not do lazy eval for metadata or data. If we load in bulk or with fancy indexing, then I don't think there will be a big performance drain as biom 2.0 was fast. If it remains slow, then lazy eval and the associated complexity may be worth it On Aug 13, 2014 7:34 AM, "adamrp" notifications@github.com wrote:

@wasade https://github.com/wasade, I didn't mean to imply that the intended scope of the function was the important part, but rather that the Table.metadata function would need to behave differently (i.e., returning the requested metadata would have to call different functions) depending on whether or not the on-disk representation was HDF5 or not.

For example, the current implementation (see this line https://github.com/biocore/biom-format/blob/master/biom/table.py#L916) assumes all the metadata is in memory. If we were going to speed up load_table for HDF5 biom tables by not loading all of the metadata upon load_table, the Table object would have to know whether or not the on-disk representation was HDF5 or "other" (so we'd probably want a new attribute -- or attributes -- in the Table class keeping track of the on-disk format and, in the case it is HDF5, the h5py.File object). If "other" (JSON/TSV), it would behave as it currently does; but if HDF5 then it could load the metadata from disk as needed, possibly with memoizing it so that it doesn't load it from disk more than once and so that it can keep track of any changes that are made during current processing.

so it could just be that we should load in full first and than subset (if needed) after on the assumption that metadata will always be a minor memory footprint and quick to bulk load.

I could be wrong, but it looks to me like it actually is loading all the metadata at once (which is reinforced by the comment on line 2986 https://github.com/biocore/biom-format/blob/master/biom/table.py#L2986), which is in this case the most time-consuming step.

— Reply to this email directly or view it on GitHub https://github.com/biocore/biom-format/issues/529#issuecomment-52048075 .

adamrp commented 10 years ago

That makes sense. I don't know how to improve the loading as it's currently implemented, so instead I was suggesting that we might want to do lazy metadata loading (only for HDF5 biom files), since it can apparently be quite slow to load it using the current implementation, and some tasks do not require loading any metadata at all, while other tasks will rely on only one metadata category (e.g., summarizing the table by a particular category).

My vote is for whichever is easier!

josenavas commented 10 years ago

@wasade I did not perform any performance test during the development of 2.1... my bad

I will do some tests and I will look deeper in the code to tests which is the fastest way to load the metadata. I'll work on this by the end of the week or beginning of next week.

ElDeveloper commented 10 years ago

Could this also be slowing down the performance of subsample? Subsampling a 900 samples x 16,000 observations table takes around 16 seconds, which seems like too much to me specially because there's an underlying cython implementation to this method in specific.

josenavas commented 10 years ago

I don't think that this has something to do with the subsample, as the subsample should operate on memory.

wasade commented 10 years ago

@ElDeveloper, that is much slower than I'd expect. Can you open a new issue, and profile please if you have a second? I'd be very interested to know what is going on.

ElDeveloper commented 10 years ago

Will do later today and will provide an example table.

On (Aug-18-14|12:57), Daniel McDonald wrote:

@ElDeveloper, that is much slower than I'd expect. Can you open a new issue, and profile please if you have a second? I'd be very interested to know what is going on.

Reply to this email directly or view it on GitHub: https://github.com/biocore/biom-format/issues/529#issuecomment-52545870

wasade commented 10 years ago

Thanks!

ElDeveloper commented 10 years ago

Just opened #531

wasade commented 9 years ago

@josenavas, did you ever get around to doing any testing on improving load speeds?

biocore / biom-format

loading all metadata during Table.from_hdf5 is very slow #529