Requesting `process_dataframe()`

maxschmitt commented 3 months ago

Problem

Currently, it is not possible to apply a processing function of audinterface.Process or audinterface.Feature to a DataFrame object. Such a method would be meaningful, as process_index() cannot be efficiently used when a Segment object is passed as segment argument, because labels need to be carried to the resulting dataframe (the resulting index after segmentation has typically additional rows).

Solution

A new method process_dataframe() could solve this.

If no segmentation is required, the behaviour is very similar to process_index(), but all labels are kept and attached to the output.
If a segmentation is required, for each row in the input, the labels are duplicated and attached to all corresponding rows at the output.

@hagenw

hagenw commented 3 months ago

Is the idea behind it, that the processing function has also access to the labels in the dataframe, or just that you can use the newly segmented dataframe with the column of the original labels as ground truth afterwards. The later, could most likely also be solved by a function that takes as input the original dataframe and the new index, and then assigns the original labels to the new segments.

The first part was handled so far by the special processing args idx, file, root, see https://audeering.github.io/audinterface/usage.html#special-processing-function-arguments. This we introduced based on an earlier issue, that proposed to add process_table() in https://github.com/audeering/audinterface/issues/25.

But I agree, that it might be more elegant to just add something like process_dataframe() or process_table().

maxschmitt commented 3 months ago

Processing the original labels was not something I had in my mind, so far, so the "external" function might be sufficient. However, doing this is relatively time-consuming for a large table, so, the "elegant" solution would be preferred.

hagenw commented 3 months ago

Having access to the labels is also not that easy, as usually we provide a processing function, that works also for process_file(). Which means we cannot assume that it has access to the labels. For that I would stick to the solution introduced with https://audeering.github.io/audinterface/usage.html#special-processing-function-arguments.

Which means the new process_dataframe() can be restricted to update the index and assign the labels accordingly. One challenge we might face here, is that there might be a naming clash between the original labels of the dataframe and the new ones added by process_func. Maybe it would be better if process_dataframe() returns two dataframes then. One with the original labels, and a second with the new labels? In the case of audinterface.Process() it should also return a series not a dataframe for the second object. And we might also want to support providing a series instead of a dataframe. So I guess, we also need a better name for the method. Maybe process_table()?

maxschmitt commented 3 months ago

process_table() sounds good

hagenw commented 3 months ago

Great, @maxschmitt would you be able to try to work on it?

maxschmitt commented 3 months ago

able to try

sounds reasonable ;) challenge accepted

maxschmitt commented 2 months ago

@hagenw

Thinking about it, would it actually be necessary to have process_table() also in Process and Feature? I mean, we need it in Process as the segmentation is handled there, but I'm not sure whether it should be an API function.

My idea would be to have process_table() only in Segment.

For Process, it does not make sense imho and it should be aligned with Feature. I we added it, we would end up in the "dirty" solution that two objects are returned. Moreover, there might be rarely cases where we want to segment, keep labels, and compute new features at the same time.

maxschmitt commented 2 months ago

I am not sure if I got it correct, but:

The solution from https://github.com/audeering/audinterface/issues/25 will not work as we would need the idx in the segmentation function, which won't be able to handle idx or labels (consider the case where we have "external" segmentation functions). Moreover, Segment.process_index() cannot return a Series or DataFrame object.

Only Process.process_func does have access to the labels, but there, we do not have access to the original index anymore.

The idea of having process_table() somehow crashes the whole framework as the outputs of processing functions are not consistent anymore within each interface (Segment, Process, Features).

hagenw commented 2 months ago

The solution from https://github.com/audeering/audinterface/issues/25 will not work as we would need the idx in the segmentation function, which won't be able to handle idx or labels (consider the case where we have "external" segmentation functions).

Yes and no. If your starting point is a filewise index, you can use the special argument file with the current implementation:

import audb
import audinterface
import auvad

# Prepare data
media = [ 
    "wav/03a01Fa.wav",
    "wav/03a01Nc.wav",
    "wav/16b10Wb.wav",
]
db = audb.load(
    "emodb",
    version="1.4.1",
    media=media,
    full_path=False,
    verbose=False,
)
df = db.get("emotion")

def access_label(signal, sampling_rate, file, df, label="emotion"):
    return df.loc[file, label]

vad = auvad.Vad(max_turn_length=1)
interface = audinterface.Feature(
    "emotion",
    process_func=access_label,
    process_func_args={"df": df},
    segment=vad,
)
df_segmented = interface.process_index(db.files, root=db.root)
print(df_segmented)

which returns

                                                                 emotion
file            start                  end                              
wav/03a01Fa.wav 0 days 00:00:00.120000 0 days 00:00:01.760000  happiness
wav/03a01Nc.wav 0 days 00:00:00.060000 0 days 00:00:01.390000    neutral
wav/16b10Wb.wav 0 days 00:00:00.040000 0 days 00:00:01.450000      anger
                0 days 00:00:01.540000 0 days 00:00:02.380000      anger

But you are right, if the starting dataframe contains already a segmented index, then we cannot handle it with the current solution. There you would need to first run the VAD and with the result create a new dataframe using the index returned by the VAD and assign the label accordingly. Afterwards, you can then use that dataframe together with idx in audinterface.

At the moment, I'm not sure how easy/complicated it will be to change audinterface to support this out-of-the-box.

hagenw commented 2 months ago

One straightforward fix for supporting also segmented indices would be to introduce start and end as special arguments as well. Then you would have access to file, start, end and could access the original segment as I do in the above example by using only file for a filewise index.

maxschmitt commented 2 months ago

But wouldn't the access to the labels be very inefficient, especially for large tables, as we need to get the labels for each row separately?

My (current) idea is to add a method Segment.process_table() that differs from process_index() in the loop where the new segments are generated by attaching also the labels: https://github.com/audeering/audinterface/blob/bdb078c0b7fd01b99609e730804a01968f71942a/audinterface/core/segment.py#L501

If I am not completely wrong, this would result in only a minor change (the new Segment.process_table()) without affecting any existing code.

The drawback of this method is, of course, that we do not have this new method for the Process and Feature interfaces (not sure if it would be also straightforward to integrate them) but as I said before, I am not sure whether it makes sense to support this at all (given the "multiple-columns" issue).

hagenw commented 2 months ago

I also see the point in adding a process_table() method, since using the special arguments is always complicated to understand anyway. And if it is also more efficient to have an extra process_table() method, the better.

But I'm not so sure if we could add it only to Segment. Then you would have to run first a Segment object on your dataframe, and afterwards you run Feature.process_index() on it's index. I would prefer to instantiate Feature with the Segment object provided via the segment argument, and when running Feature.process_table(), it then calls automatically Segment.process_table() under the hood. But maybe, I also misunderstand your suggestion. If you like, you could create a pull request showing how you would solve the issue.

maxschmitt commented 2 months ago

To be honest, I usually do not feel too comfortable when mixing two independent (segmentation, feature extraction) steps into a single function/method, because it makes the package more complex and less transparent. Is there any disadvantage other than having an additional line of code?

Generally, when doing segmentation and feature extraction, there are two cases:

segmentation -> features
features -> segmentation

At the moment, only 1. is supported but it might also be relevant to have 2, which requires using/calling audinterface twice, anyway. Just as a thought, I don't want to "ruin" the concept of audinterface, of course.

I implemented a first version of Segment.process_table() here: https://github.com/audeering/audinterface/commit/fd35a8307b2e14972f1846c952f0de0701dd9dc3

Please check and we can see if it makes sense and if we should also have it in Process and Feature.

Test:

```python import audb import audinterface import numpy as np import os import pandas as pd def rms(signal, sampling_rate): return 20 * np.log10(np.sqrt(np.mean(signal ** 2))) def segment(signal, sampling_rate): duration = signal.shape[-1] / sampling_rate chunk_len = 0.7 chunks = [] for i in range(int(duration // chunk_len) + 1): chunks.append((i * chunk_len, np.min([(i+1) * chunk_len, duration]))) index = pd.MultiIndex.from_tuples( [ ( pd.Timedelta(start, unit="s"), pd.Timedelta(end, unit="s"), ) for start, end in chunks ], names=["start", "end"], ) return index media = [ "wav/03a01Fa.wav", "wav/03a01Nc.wav", "wav/16b10Wb.wav", ] db = audb.load( "emodb", version="1.3.0", media=media, verbose=False, ) files = list(db.files) folder = os.path.dirname(files[0]) index = db["emotion"].index # Compute RMS interface = audinterface.Process(process_func=rms) table_series = interface.process_index(index) print(table_series) # Segmentation with Series seg_interface = audinterface.Segment(process_func=segment) print(seg_interface.process_table(table_series)) # Segmentation with Dataframe table_df = pd.DataFrame(np.concatenate((table_series.values.reshape(-1, 1), table_series.values.reshape(-1, 1) * 2), axis=-1), table_series.index, columns=["RMS", "RMSx2"]) print(seg_interface.process_table(table_df)) ```

audeering / audinterface

Requesting `process_dataframe()` #167

Problem

Solution