Open maxschmitt opened 3 months ago
Is the idea behind it, that the processing function has also access to the labels in the dataframe, or just that you can use the newly segmented dataframe with the column of the original labels as ground truth afterwards. The later, could most likely also be solved by a function that takes as input the original dataframe and the new index, and then assigns the original labels to the new segments.
The first part was handled so far by the special processing args idx
, file
, root
, see https://audeering.github.io/audinterface/usage.html#special-processing-function-arguments.
This we introduced based on an earlier issue, that proposed to add process_table()
in https://github.com/audeering/audinterface/issues/25.
But I agree, that it might be more elegant to just add something like process_dataframe()
or process_table()
.
Processing the original labels was not something I had in my mind, so far, so the "external" function might be sufficient. However, doing this is relatively time-consuming for a large table, so, the "elegant" solution would be preferred.
Having access to the labels is also not that easy, as usually we provide a processing function, that works also for process_file()
. Which means we cannot assume that it has access to the labels. For that I would stick to the solution introduced with https://audeering.github.io/audinterface/usage.html#special-processing-function-arguments.
Which means the new process_dataframe()
can be restricted to update the index and assign the labels accordingly.
One challenge we might face here, is that there might be a naming clash between the original labels of the dataframe and the new ones added by process_func
. Maybe it would be better if process_dataframe()
returns two dataframes then. One with the original labels, and a second with the new labels? In the case of audinterface.Process()
it should also return a series not a dataframe for the second object. And we might also want to support providing a series instead of a dataframe. So I guess, we also need a better name for the method. Maybe process_table()
?
process_table()
sounds good
Great, @maxschmitt would you be able to try to work on it?
able to try
sounds reasonable ;) challenge accepted
@hagenw
Thinking about it, would it actually be necessary to have process_table()
also in Process
and Feature
?
I mean, we need it in Process
as the segmentation is handled there, but I'm not sure whether it should be an API function.
My idea would be to have process_table()
only in Segment
.
For Process
, it does not make sense imho and it should be aligned with Feature
. I we added it, we would end up in the "dirty" solution that two objects are returned. Moreover, there might be rarely cases where we want to segment, keep labels, and compute new features at the same time.
I am not sure if I got it correct, but:
The solution from https://github.com/audeering/audinterface/issues/25 will not work as we would need the idx
in the segmentation function, which won't be able to handle idx
or labels (consider the case where we have "external" segmentation functions).
Moreover, Segment.process_index()
cannot return a Series
or DataFrame
object.
Only Process.process_func
does have access to the labels, but there, we do not have access to the original index anymore.
The idea of having process_table()
somehow crashes the whole framework as the outputs of processing functions are not consistent anymore within each interface (Segment
, Process
, Features
).
The solution from https://github.com/audeering/audinterface/issues/25 will not work as we would need the idx in the segmentation function, which won't be able to handle idx or labels (consider the case where we have "external" segmentation functions).
Yes and no. If your starting point is a filewise index, you can use the special argument file
with the current implementation:
import audb
import audinterface
import auvad
# Prepare data
media = [
"wav/03a01Fa.wav",
"wav/03a01Nc.wav",
"wav/16b10Wb.wav",
]
db = audb.load(
"emodb",
version="1.4.1",
media=media,
full_path=False,
verbose=False,
)
df = db.get("emotion")
def access_label(signal, sampling_rate, file, df, label="emotion"):
return df.loc[file, label]
vad = auvad.Vad(max_turn_length=1)
interface = audinterface.Feature(
"emotion",
process_func=access_label,
process_func_args={"df": df},
segment=vad,
)
df_segmented = interface.process_index(db.files, root=db.root)
print(df_segmented)
which returns
emotion
file start end
wav/03a01Fa.wav 0 days 00:00:00.120000 0 days 00:00:01.760000 happiness
wav/03a01Nc.wav 0 days 00:00:00.060000 0 days 00:00:01.390000 neutral
wav/16b10Wb.wav 0 days 00:00:00.040000 0 days 00:00:01.450000 anger
0 days 00:00:01.540000 0 days 00:00:02.380000 anger
But you are right, if the starting dataframe contains already a segmented index, then we cannot handle it with the current solution. There you would need to first run the VAD and with the result create a new dataframe using the index returned by the VAD and assign the label accordingly. Afterwards, you can then use that dataframe together with idx
in audinterface
.
At the moment, I'm not sure how easy/complicated it will be to change audinterface
to support this out-of-the-box.
One straightforward fix for supporting also segmented indices would be to introduce start
and end
as special arguments as well. Then you would have access to file
, start
, end
and could access the original segment as I do in the above example by using only file
for a filewise index.
But wouldn't the access to the labels be very inefficient, especially for large tables, as we need to get the labels for each row separately?
My (current) idea is to add a method Segment.process_table()
that differs from process_index()
in the loop where the new segments are generated by attaching also the labels:
https://github.com/audeering/audinterface/blob/bdb078c0b7fd01b99609e730804a01968f71942a/audinterface/core/segment.py#L501
If I am not completely wrong, this would result in only a minor change (the new Segment.process_table()
) without affecting any existing code.
The drawback of this method is, of course, that we do not have this new method for the Process
and Feature
interfaces (not sure if it would be also straightforward to integrate them) but as I said before, I am not sure whether it makes sense to support this at all (given the "multiple-columns" issue).
I also see the point in adding a process_table()
method, since using the special arguments is always complicated to understand anyway. And if it is also more efficient to have an extra process_table()
method, the better.
But I'm not so sure if we could add it only to Segment
. Then you would have to run first a Segment
object on your dataframe, and afterwards you run Feature.process_index()
on it's index. I would prefer to instantiate Feature
with the Segment
object provided via the segment
argument, and when running Feature.process_table()
, it then calls automatically Segment.process_table()
under the hood.
But maybe, I also misunderstand your suggestion. If you like, you could create a pull request showing how you would solve the issue.
To be honest, I usually do not feel too comfortable when mixing two independent (segmentation, feature extraction) steps into a single function/method, because it makes the package more complex and less transparent. Is there any disadvantage other than having an additional line of code?
Generally, when doing segmentation and feature extraction, there are two cases:
At the moment, only 1. is supported but it might also be relevant to have 2, which requires using/calling audinterface
twice, anyway.
Just as a thought, I don't want to "ruin" the concept of audinterface
, of course.
I implemented a first version of Segment.process_table()
here:
https://github.com/audeering/audinterface/commit/fd35a8307b2e14972f1846c952f0de0701dd9dc3
Please check and we can see if it makes sense and if we should also have it in Process
and Feature
.
Test:
Problem
Currently, it is not possible to apply a processing function of
audinterface.Process
oraudinterface.Feature
to aDataFrame
object. Such a method would be meaningful, asprocess_index()
cannot be efficiently used when aSegment
object is passed assegment
argument, because labels need to be carried to the resulting dataframe (the resulting index after segmentation has typically additional rows).Solution
A new method
process_dataframe()
could solve this.process_index()
, but all labels are kept and attached to the output.@hagenw