Rework metrics - Githubissues

LAAC-LSCP / ChildProject

Python package for the management of day-long recordings of children.

https://childproject.readthedocs.io

MIT License

13 stars 5 forks source link

Rework metrics #376

Closed LoannPeurey closed 2 years ago

LoannPeurey commented 2 years ago

The metrics pipeline needs more flexibility, the goal is to have the possibility to include whatever metric we want in the resulting csv file (from a list of available metrics or from a custom made function taking annotations and duration and outputting a value to include in the result) Old pipelines should be kept, but will work by simply giving the list of what was done in the past to the new system. The period pipeline is going to disappear as --period will become an option available for every extraction and not restricted to that specific subcommand. In order to give all the metrics wanted as parameters, a file will be used (yml ? / csv ?, maybe both). At the end of the extraction, all the parameters used will be saved in a file. This way of handling things will come with a performance hit (as every metric is computed separately, the previous optimizations taking advantage of dependencies or computing in batch the values will not be possible anymore) and this needs to be evaluated

[x] core of the Metrics class
[x] list of available metrics with their function (metricsFunctions.py)
[x] LenaMetrics and AclewMetrics backward compatibility
[x] Usage of custom functions passed to the Metrics class to allow for custom metrics
[x] Define the format and content of parameters file
[x] outputs a file containing the parameters used
[ ] ~~performance comparison~~ moved as an issue for later
[x] Documentation
[x] Tests
[ ] ~~Turns count implementation~~ later implementation

LoannPeurey commented 2 years ago

Additional corrections to make :

[X] Manage conflicts between cli arguments and parameters in the yml file (probably raise error if conflict). => outdated, yml file pipeline does not allow for additional cli arguments
- [x] Manage conflicts of name (ie multiple metrics would have the same name (raise error or warn ? )
- [x] drop support for python 3.6
- [x] check for correct metrics values returned between 0 and NaN

LoannPeurey commented 2 years ago

[x] check that recordings option works

LoannPeurey commented 2 years ago

New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?

Next step being the addition of the lena CTC CVC to our supported metrics list, is that clear for you how we do it?

kalenkovich commented 2 years ago

New code and tests make sense to me. @LoannPeurey, would you like me to check anything specific in more detail?

Next step being the addition of the lena CTC CVC to our supported metrics list, is that clear for you how we do it?

Something like this?

@metricFunction(args={}, cols={"lena_ctc"})
def lena_ctc(annotations: pd.DataFrame, max_distance_ms: int = 1, **kwargs):
   <pretty much a copy of the current ctc code>

Then "lena_ctc" should be supplied in the input dataframe.

LoannPeurey commented 2 years ago

The main difference is that we don't have a "lena_ctc" column extracted from its files. We store the columns:

lena_speaker
lena_block_number
lena_block_type
lena_conv_status
lena_response_count
lena_conv_turn_type
lena_conv_floor_type
utterances_count
utterances_length

So CTC should be computed from some of this columns I think. You can see what converted its files are from here for example. And the description of the column content here

I see that you have added the max_distance_ms argument, foes lena allow to change that? If we can include it, it should be as a kwarg, the duration arg however should alway be here, even if not used because it is passed to all metrics.

So closer to:

@metricFunction(args={}, cols={"lena_conv_status","lena_conv_turn_type" }) #put relevant columns
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
   max_distance_ms = kwargs["max_distance_ms"]
   <pretty much a copy of the current ctc code>

kalenkovich commented 2 years ago

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦

As for the actual question, I would extract one more column from its by adding the following to converters.py

class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]

That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like

        df = pd.DataFrame(segments)

        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df

After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):

@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()

LoannPeurey commented 2 years ago

Since we already merged this branch and went on to new additions, I suggest you check out the #394 PR, @kalenkovich I'll add you as a reviewer there, so check the changes if you want (and your review will probably be useful). Just know that this will be merged and hopefully released at the beginning of next week regardless.

LoannPeurey commented 2 years ago

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦

As for the actual question, I would extract one more column from its by adding the following to converters.py
class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
        df = pd.DataFrame(segments)

        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df
After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):
@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()

Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets

kalenkovich commented 2 years ago

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦 As for the actual question, I would extract one more column from its by adding the following to converters.py
class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
        df = pd.DataFrame(segments)

        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df
After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):
@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()
Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets

Isn't that a common situation where you need some extra information from the raw files? Just curious, because in this case, it is probably unnecessary since your calculation should yield the same numbers as mine.

LoannPeurey commented 2 years ago

Hi, @LoannPeurey! I totally misunderstood your question, to be honest. I thought you were checking if I understood the new metrics logic 🤦 And obviously I misread the purpose of the columns argument of the decorator function 🤦 🤦 Also I forgot that we were trying to get ctc as calculated by LENA, not calculate ctc from its files the same way it is calculated from the VTC files. Hence, the max_distance_ms: int = 1 argument. 🤦🤦🤦 As for the actual question, I would extract one more column from its by adding the following to converters.py
class ItsConverter(AnnotationConverter):
    ...
                lena_cumulative_ctc = conversation_info[2]
That field is empty except for the segments which are the last part of a given conversational turn, therefore, it has to be forward-filled and then zero-filled in the beginning. I would probably do it at the conversion stage with something like
        df = pd.DataFrame(segments)

        # lena_cumulative_ctc is NA for any segment where ctc hasn't changed, let's fill the holes
        df['lena_cumulative_ctc'] = df.lena_cumulative_ctc.ffill().fillna(0)

        return df
After that, for any interval we would take max(lena_cumulative_ctc) - min(lena_cumulative_ctc):
@metricFunction(columns={"lena_cumulative_ctc"}, emptyValue = 0)
def lena_ctc(annotations: pd.DataFrame, duration: int, **kwargs):
    """Conversational turn count (ctc) as calculated by LENA."""
    return annotations.lena_cumulative_ctc.max() - annotations.lena_cumulative_ctc.min()
Understood, if we have to import a new column from the its, it means that this will break backward compatibility and we will need to reimport most of our datasets
Isn't that a common situation where you need some extra information from the raw files? Just curious, because in this case, it is probably unnecessary since your calculation should yield the same numbers as mine.

I am not sure if that has occured in the past. I am far from being an expert on lena's file so I am not sure how exhaustive our importation is. But in most other file types, I think the importation aims at getting all the info converted.