Closed LoannPeurey closed 2 years ago
Additional corrections to make :
New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?
Next step being the addition of the lena CTC CVC to our supported metrics list, is that clear for you how we do it?
New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?
Next step being the addition of the lena CTC CVC to our supported metrics list, is that clear for you how we do it?
Something like this?
@metricFunction(args={}, cols={"lena_ctc"})
def lena_ctc(annotations: pd.DataFrame, max_distance_ms: int = 1, **kwargs):
<pretty much a copy of the current ctc code>
Then "lena_ctc" should be supplied in the input dataframe.
The main difference is that we don't have a "lena_ctc" column extracted from its files. We store the columns:
So CTC should be computed from some of this columns I think. You can see what converted its files are from here for example. And the description of the column content here
I see that you have added the max_distance_ms
argument, foes lena allow to change that? If we can include it, it should be as a kwarg, the duration arg however should alway be here, even if not used because it is passed to all metrics.
So closer to:
@metricFunction(args={}, cols={"lena_conv_status","lena_conv_turn_type" }) #put relevant columns
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
max_distance_ms = kwargs["max_distance_ms"]
<pretty much a copy of the current ctc code>
Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns
argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1
argument. 🤦🤦🤦
As for the actual question, I would extract one more column from its by adding the following to converters.py
class ItsConverter(AnnotationConverter):
...
lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
df = pd.DataFrame(segments)
# lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)
return df
After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc)
:
@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
"""Conversational turn count (ctc) as calculated by LENA."""
return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()
Since we already merged this branch and went on to new additions, I suggest you check out the #394 PR, @kalenkovich I'll add you as a reviewer there, so check the changes if you want (and your review will probably be useful). Just know that this will be merged and hopefully released at the beginning of next week regardless.
Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the
columns
argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, themax_distance_ms: int = 1
argument. 🤦🤦🤦As for the actual question, I would extract one more column from its by adding the following to
converters.py
class ItsConverter(AnnotationConverter): ... lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
df = pd.DataFrame(segments) # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0) return df
After that, for any interval we would take
max(lena_cumulative_ctc) - min(lena_cumulative_ctc)
:@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0) def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs): """Conversational turn count (ctc) as calculated by LENA.""" return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()
Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets
Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the
columns
argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, themax_distance_ms: int = 1
argument. 🤦🤦🤦 As for the actual question, I would extract one more column from its by adding the following toconverters.py
class ItsConverter(AnnotationConverter): ... lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
df = pd.DataFrame(segments) # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0) return df
After that, for any interval we would take
max(lena_cumulative_ctc) - min(lena_cumulative_ctc)
:@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0) def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs): """Conversational turn count (ctc) as calculated by LENA.""" return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()
Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets
Isn't that a common situation where you need some extra information from the raw files? Just curious, because in this case, it is probably unnecessary since your calculation should yield the same numbers as mine.
Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the
columns
argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, themax_distance_ms: int = 1
argument. 🤦🤦🤦 As for the actual question, I would extract one more column from its by adding the following toconverters.py
class ItsConverter(AnnotationConverter): ... lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
df = pd.DataFrame(segments) # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0) return df
After that, for any interval we would take
max(lena_cumulative_ctc) - min(lena_cumulative_ctc)
:@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0) def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs): """Conversational turn count (ctc) as calculated by LENA.""" return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()
Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets
Isn't that a common situation where you need some extra information from the raw files? Just curious, because in this case, it is probably unnecessary since your calculation should yield the same numbers as mine.
I am not sure if that has occured in the past. I am far from being an expert on lena's file so I am not sure how exhaustive our importation is. But in most other file types, I think the importation aims at getting all the info converted.
The metrics pipeline needs more flexibility, the goal is to have the possibility to include whatever metric we want in the resulting csv file (from a list of available metrics or from a custom made function taking annotations and duration and outputting a value to include in the result) Old pipelines should be kept, but will work by simply giving the list of what was done in the past to the new system. The period pipeline is going to disappear as
--period
will become an option available for every extraction and not restricted to that specific subcommand. In order to give all the metrics wanted as parameters, a file will be used (yml ? / csv ?, maybe both). At the end of the extraction, all the parameters used will be saved in a file. This way of handling things will come with a performance hit (as every metric is computed separately, the previous optimizations taking advantage of dependencies or computing in batch the values will not be possible anymore) and this needs to be evaluatedperformance comparisonmoved as an issue for laterTurns count implementationlater implementation