Documenting merge functionality

mortonjt commented 2 hours ago

I ran into an interesting edge case that I didn't consider before.

Say that you have two tables with the same sample ids and some of the same features, how does merge handle this?

See the following code and the resulting behavior

import biom
import pandas as pd
df1 = pd.DataFrame({
    'a': [0, 0, 1, 1],
    'b': [0, 0, 1, 1],
    'c': [0, 0, 1, 1],
})

df2 = pd.DataFrame({
    'b': [1, 1, 1, 1],
    'c': [0, 0, 1, 1],
    'd': [0, 0, 1, 1],
})

t1 = biom.Table(df1.values, df1.index, df1.columns)
t2 = biom.Table(df2.values, df2.index, df2.columns)
t1.merge(t2).to_dataframe().to_dict()
{'a': {0: 0, 1: 0, 2: 1.0, 3: 1.0},
 'b': {0: 1.0, 1: 1.0, 2: 2.0, 3: 2.0},
 'c': {0: 0, 1: 0, 2: 2.0, 3: 2.0},
 'd': {0: 0, 1: 0, 2: 1.0, 3: 1.0}}

Merge will basically add together the counts between the 2 tables where there are overlapping features. I'm not sure if this is always the appropriate behavior (it works well for my use case), but it may warrant updates to the documentation at the very least.

wasade commented 2 hours ago

That's exactly what merge is designed for? The non overlapping case can be handled more efficiently in concat

mortonjt commented 2 hours ago

When I'm doing merge, most of my use cases resolve around merging samples across different studies -- so I don't normally encounter this use case.

However, I did encounter this when was trying to merge node abundances with tip abundances using from skbio.diversity._util._vectorize_counts_and_tree, where there could be some overlap between features across the same samples. It wasn't clear to me how merge was handling common features (i.e average, sum, pick first, ...). Which is why I thought raising this issue to at least provide "some" context on how merge handles this could be useful.

wasade commented 2 hours ago

The example shown here appears to be consistent with the example contained in merge. I would be happy to accept a PR with revisions to the docstring, but from my perspective, it seems the case here is already documented.

An early version of merge allowed for controlling the exact operation performed when combining values, but it relied on the operator module on a per-element basis as was quite slow. I don't immediately see where in the changelog we removed that, but I think it was quite a ways back, and in part as we were unaware of its use beyond summing.

For the general use noted, I advise concat as it is explicitly designed for axes which are mutually exclusive, and can benefit from faster codepaths.

biocore / biom-format

Documenting merge functionality #977