LAAC-LSCP / datasets

DataLad superdataset including all the datasets currently managed by the LAAC/LSCP team
https://laac-lscp.github.io/ChildRecordsData/
2 stars 0 forks source link

Solomon work #18

Open lucasgautheron opened 3 years ago

lucasgautheron commented 3 years ago
lucasgautheron commented 3 years ago

VTC is running for solomon on oberon and it should take about a week..

lucasgautheron commented 3 years ago

Release held jobs once tsimane2018 started:

scontrol release 120615,120614,120612,120613,120611,120610,120609,120608,120607,120606,120605,120604,120603,120602,120601,120600,120599,120598,120597,120596,120595
lucasgautheron commented 3 years ago

Current result :

annotations.zip

Generated with the following script :

from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
import os
import pandas as pd
import glob
import pympi

project = ChildProject('.')
am = AnnotationManager(project)

files = pd.DataFrame([
    {'raw_filename': os.path.join('solis', os.path.basename(f))} for f in glob.glob('raw_annotations/solis/*.eaf')
])
extract = files['raw_filename'].str.extract(r"([0-9]+_(?:CW[0-9]+|NA)_CH[0-9]+_(?:AJ|FB|LM)[0-9]+_(?:AJ|FB|LM)[0-9]+_[0-9]{6})_([0-9])?(?:.*)(?:\-|_)([A-Z]{2})")
files['recording_filename'] = extract[0] + '.WAV'
files['part'] = extract[1]
files['author'] = extract[2]
files = files.merge(project.recordings, how = 'inner', left_on = 'recording_filename', right_on = 'filename')

files['part'] = files['part'].fillna(1)
files['time_seek'] = (files['part'].astype(int)-1)*15*3600
files['set'] = 'solis_eaf_' + files['author']

ranges = []
for f in files.to_dict(orient = 'records'):
    eaf = pympi.Elan.Eaf(os.path.join('raw_annotations', f['raw_filename']))
    portions = eaf.tiers['code_periodic'][0]

    for pid in portions:
        (start_ts, end_ts, value, svg_ref) = portions[pid]
        (start_t, end_t) = (eaf.timeslots[start_ts], eaf.timeslots[end_ts])

        ranges.append({'raw_filename': f['raw_filename'], 'range_onset': int(start_t/1000), 'range_offset': int(end_t/1000)})

ranges = pd.DataFrame(ranges)

input = files[['recording_filename', 'raw_filename', 'time_seek', 'set']]
input['format'] = 'eaf'

ranges = ranges.merge(input, how = 'left', left_on = 'raw_filename', right_on = 'raw_filename')
am.import_annotations(ranges)

Based on https://github.com/LAAC-LSCP/ChildRecordsData/pull/49/commits/54cee9b38df99a802af819a43520a751b5c87340

lucasgautheron commented 3 years ago

Last result : annotations.zip

import_eaf.py

from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
import os
import pandas as pd
import glob
import pympi

project = ChildProject('.')
am = AnnotationManager(project)

files = pd.DataFrame([
    {'raw_filename': os.path.join('solis', os.path.basename(f))} for f in glob.glob('raw_annotations/solis/*.eaf')
])
extract = files['raw_filename'].str.extract(r"([0-9]+_(?:CW[0-9]+|NA)_CH[0-9]+_(?:AJ|FB|LM)[0-9]+_(?:AJ|FB|LM)[0-9]+_[0-9]{6})_([0-9])?(?:.*)(?:\-|_)([A-Z]{2})")
files['recording_filename'] = extract[0] + '.WAV'
files['part'] = extract[1]
files['author'] = extract[2]
files = files.merge(project.recordings, how = 'inner', left_on = 'recording_filename', right_on = 'filename')

files['part'] = files['part'].fillna(1)
files['time_seek'] = (files['part'].astype(int)-1)*15*3600
files['set'] = 'solis_eaf_' + files['author']

ranges = []
for f in files.to_dict(orient = 'records'):
    eaf = pympi.Elan.Eaf(os.path.join('raw_annotations', f['raw_filename']))
    portions = eaf.tiers['code_periodic'][0]

    for pid in portions:
        (start_ts, end_ts, value, svg_ref) = portions[pid]
        (start_t, end_t) = (eaf.timeslots[start_ts], eaf.timeslots[end_ts])

        ranges.append({'raw_filename': f['raw_filename'], 'range_onset': int(start_t/1000), 'range_offset': int(end_t/1000)})

ranges = pd.DataFrame(ranges)

input = files[['recording_filename', 'raw_filename', 'time_seek', 'set']]
input['format'] = 'eaf'

ranges = ranges.merge(input, how = 'left', left_on = 'raw_filename', right_on = 'raw_filename')
am.import_annotations(ranges)
lucasgautheron commented 3 years ago

_analyses_only.pdf

so the first step for solomon would be "Get all stats at the level of the recording"

the derived metrics that analysis outputs are explained in the section "Derived metrics"

How vocalization events are “integrated”: • vocalization counts (ie how many vocalization events) • vocalization duration (ie total time vocalized) Sources of vocalizations: • key child (CHI) • female adults (FEM) • male adults (MAL) • all adults (summing FEM and MAL, ADU) • other children (OCH) Context of vocalizations: • unmarked (ie all vocalizations regardless of the context) • for non-CHI child-directed speech or “cds” (only vocalizations that occur in temporal proximity of the key child’s vocalizations) • for the child’s vocalizations, that occur in temporal proximity of someone else’s vocalizations – i.e. in “conversations” Finally, some authors have popularized the notion of “turns”, counting the number of times an adult and a child vocalize in close temporal proximity. These measures tend to have terrible machine-human reliability, 2 but for comparability with previous work (and since our auto algorithm is much better than that used in previous work), we have also extracted turns. This is the full current list of derived metrics: • counts (unmarked): vc_chi,vc_fem,vc_mal,vc_adu,vc_och, • duration (unmarked): voc_dur_chi,voc_dur_fem,voc_dur_mal,voc_dur_adu,voc_dur_och, • cds: voc_dur_och_cds,voc_dur_fem_cds,voc_dur_mal_cds,voc_dur_adu_cds, • conv(ersations): voc_dur_chi_conv, • turns: turns_chi,turns_och,turns_fem,turns_mal,turns_adu

step two: basic checks

once we have all those metrics for each recording, check the correlation and error rate across the two recordings associated to each child

we expect vc_chi from recording 1 and vc_chi from recording 2 to be identical

step three: descriptive statistics

boxplot+individual points of all the derived metrics, to check for outliers or other problems

lucasgautheron commented 3 years ago

For the record :

proc = subprocess.Popen(
        [
            'Rscript', 'tests/truth/extract_quantities.R',
            os.path.join(project.path, 'raw_annotations', raw_rttm),
            str(turntakingthresh),
            'tests/truth/vc_truth_{:.1f}.csv'.format(turntakingthresh)
        ],
        stdout = subprocess.PIPE,
        stderr = subprocess.PIPE
    )
    stdout, stderr = proc.communicate()

    truth_vc = pd.read_csv('tests/truth/vc_truth_{:.1f}.csv'.format(turntakingthresh), names = ['col','value'])
    truth_vc.dropna(inplace = True)
    truth_vc['metric'] = truth_vc['col'].str.split('.', expand = True)[0]
    truth_vc['speaker_type'] = truth_vc['col'].str.split('.', expand = True)[1]
    truth_vc['speaker_type'] = truth_vc['speaker_type'].map(am.VTC_SPEAKER_TYPE_TRANSLATION)
    truth_vc.drop(columns = ['col'], inplace = True)
    truth_vc = truth_vc.pivot(index = 'speaker_type', columns = 'metric').droplevel(0, axis = 1)
    truth_vc.columns.name = None
    truth_vc.to_csv('tests/truth/vc_truth_{:.1f}.csv'.format(turntakingthresh))
#!/usr/bin/env Rscript

#' read rttm files
#'
#' read and prettify rttm files
#'
#' @param x character, path to rttm file
#' @param from,to numeric, specify subset along the time axis (by default both
#' are \code{NULL}, i.e. the entire file is considered)
#' @details The function adds an end time column and assigns meaningful column
#' names for the most relevant columns. The actual content of the eighth column
#' (\code{tier}) depends on the rttm file. For example, 'yunitator' rttm files
#' have levels \code{CHI}, \code{FEM} and \code{MAL}, while SAD outputs contain
#' \code{speech} (or in the case of the full noisemes module other labels like
#' \code{background}, \code{noise} etc).
#'
#' If the rttm file is empty (i.e. contains no annotations), the result is a
#' data frame with no rows.
#'
#' If either \code{from} or \code{to} are specified, segments that overlap are
#' cut accordingly. For example, if \code{from = 8} and there is a segment that
#' starts at 7.5 and ends 8.7, this segment will be included as starting at 8
#' (i.e. the \code{from} value) and ending at 8.7
#'
#' @return a data.frame
#' @export
#'
#' @examples
#' rttm <- system.file("noisemesSad_synthetic_speech_overlap.rttm",
#'                     package = "avutils")
#' read_rttm(rttm)
#' rttm <- system.file("noisemesFull_synthetic_speech_overlap.rttm",
#'                     package = "avutils")
#' read_rttm(rttm)
#' rttm <- system.file("yunitator_english_synthetic_speech_overlap.rttm",
#'                     package = "avutils")
#' read_rttm(rttm)
#' rttm <- system.file("tocomboSad_synthetic_speech_overlap.rttm",
#'                      package = "avutils")
#' read_rttm(rttm)
#' # time subsetting
#' read_rttm(rttm, from = 8, to = 11)
#' # empty result because no annotations before 5 seconds
#' read_rttm(rttm, to = 5)

read_rttm <- function(x,
                      from = NULL,
                      to = NULL) {
  if (length(readLines(x)) > 0) {
    res <- read.table(x, header = FALSE)
    colnames(res)[c(4, 5, 8)] <- c("start", "duration", "tier")
    res$end <- res$start + res$duration
  } else {
    res <- matrix(0, nrow = 0, ncol = 10)
    # res[1, ] <- NA
    colnames(res) <- paste0("V", 1:ncol(res))
    colnames(res)[c(4, 5, 8, 10)] <- c("start", "duration", "tier", "end")
    res <- data.frame(res)
  }

  if (!is.null(from)) {
    res <- res[res$end > from, ]
    if (nrow(res) > 0) {
      if (res$start[1] < from) {
        res$start[1] <- from
        res$duration[1] <- res$end[1] - res$start[1]
      }
    }
  }

  if (!is.null(to)) {
    res <- res[res$start < to, ]
    if (nrow(res) > 0) {
      if (res$end[nrow(res)] > to) {
        res$end[nrow(res)] <- to
        res$duration[nrow(res)] <- res$end[nrow(res)] - res$start[nrow(res)]
      }
    }
  }

  attributes(res)$filename <- basename(x)
  rownames(res) <- NULL
  res
}

#' extract child speech counts, duration and conversational turns
#'
#' @param path character, path to file with results from yunitator or Elan
#' @param turntakingthresh numeric, the threshold for the time between onsets of
#'        two subsequent segments to be considered a turn. Default is \code{1}
#'        second.
#' @param tiernames a list with two elements. By default set to \code{NULL},
#'        where tier names are guessed from the data type. If supplied, element
#'        1 (\code{targetchild}) should be a character of length one, and the
#'        second element (\code{adults}) a character vector of any length. See
#'        details.
#' @param ... additional arguments (currently \code{from} and \code{to} for
#'        \code{\link{read_rttm}})
#'
#' @details if the supplied file is an .rttm file, the following tier names are
#' assumed: \code{list(targetchild = "KCHI", adults = c("FEM", "MAL","CHI"))}. If
#' supplying an .eaf file, tier names are assumed as follows:
#' \code{list(targetchild = "CHI", adults = c("FA1", "MA1"))}.
#'
#' @return a list with three elements.
#' \itemize{
#' \item \code{cum_dur} the cumulative duration for each speaker
#' \item \code{voc_count} the number of speech segments for each speaker
#' \item \code{turns} the number of turns
#' }
#' @export

extract_quantities <- function(path,
                               turntakingthresh = 1,
                               tiernames = NULL,
                               ...) {
  # determine data type
  type <- rev(unlist(strsplit(basename(path), ".", fixed = TRUE)))[1]

  # set tier names depending on file type
  if (is.null(tiernames)) {
    if (type == "rttm") {
      adults <- c("FEM", "MAL","CHI")
    }
    if (type == "eaf") {
      adults <- c("FA1", "MA1")
    }
    targetchild = "KCHI"
  } else {
    targetchild <- tiernames$targetchild
    adults <- tiernames$adults
  }

  if (type == "rttm") {
    xdata <- read_rttm(x = path, ...)
  }

  if (type == "eaf") {
    adults <- c("FA1", "MA1")
    xdata <- read_elan(x = path)
    xdata <- xdata[xdata$tier %in% c(targetchild, adults), ]
    xdata <- droplevels(xdata[xdata$content %in% c("s", "s?"), ])
  }

  # remove SPEECH tier if it's there
  xdata[xdata$tier!="SPEECH",]->xdata

  # handle empty rttm files correctly
  if (nrow(xdata) > 0) {
    # count vocalizations and total duration of vocalizations
    res1 <- tapply(X = xdata$duration, INDEX = xdata$tier, sum)
    res2 <- tapply(X = xdata$duration, INDEX = xdata$tier, length)
    # turn taking
    xdata$tier <- as.character(xdata$tier)
    # calc inter turn interval
    xdata$iti <- xdata[, "start"] - c(NA, xdata[-nrow(xdata), "end"])
    # get previous seg type
    xdata$prev_label <- c(NA, xdata[-nrow(xdata),"tier"])
    # mark turns
    xdata$turn <- ifelse((xdata$tier %in% targetchild &
                            xdata$prev_label %in% adults &
                            xdata$iti < turntakingthresh) |
                           (xdata$tier %in% adults &
                              xdata$prev_label %in% targetchild &
                              xdata$iti < turntakingthresh), 1, 0)
    # count turns
    res3 <- tapply(X = xdata$turn, INDEX = xdata$tier, sum)

    # get post ITI
    xdata$post_iti <-  c(xdata[-1, "start"],NA) - xdata[, "end"]
    # get next seg type
    xdata$next_label <- c(xdata[-1,"tier"], NA)
    # mark cds
    xdata$cds <- ifelse((xdata$tier %in% targetchild &
                            xdata$prev_label %in% adults &
                            xdata$iti < turntakingthresh) |
                           (xdata$tier %in% adults &
                              xdata$prev_label %in% targetchild &
                              xdata$iti < turntakingthresh)|
                          (xdata$tier %in% targetchild &
                             xdata$next_label %in% adults &
                             xdata$post_iti < turntakingthresh) |
                          (xdata$tier %in% adults &
                             xdata$next_label %in% targetchild &
                             xdata$post_iti < turntakingthresh)
                          , 1, 0)
    # sum durations for CDS
    res4 <- tapply(X = xdata$duration[xdata$cds==1], INDEX = xdata$tier[xdata$cds==1], sum)

    return (c(cum_dur = res1,
                voc_count = res2,
                turns = res3,
                cds_dur = res4))
  } else {
    return(c(cum_dur = NA,
                voc_count = NA,
                turns = NA,
                cds_dur = NA))
  }
}

args = commandArgs(trailingOnly=TRUE)
L = extract_quantities(args[1], args[2])
write.csv(as.data.frame(L), args[3])
lucasgautheron commented 3 years ago

evaluate the stability of the procedure on pairs of recordings

import os
import pandas as pd
import sys
import multiprocessing as mp
import numpy as np
from functools import partial
import sox

from matplotlib import pyplot as plt

from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager

project = ChildProject(sys.argv[1])
project.read()

def get_audio_duration(filename):
    if not os.path.exists(filename):
        return 0

    duration = 0
    try:
        duration = sox.file_info.duration(filename)
    except:
        pass

    return duration

project.recordings['duration'] = project.recordings['filename'].map(lambda f:
    get_audio_duration(os.path.join(project.path, 'recordings', f))
)

min_durations = project.recordings.groupby(['child_id', 'date_iso'])['duration'].min().reset_index().rename(columns = {'duration': 'min_duration'})
recordings = project.recordings.merge(min_durations, left_on = ['child_id', 'date_iso'], right_on = ['child_id', 'date_iso'])

am = AnnotationManager(project)
am.annotations = am.annotations.merge(recordings[['filename', 'min_duration']], left_on = 'recording_filename', right_on = 'filename')

vtc = am.annotations[am.annotations['set'] == 'vtc']
vtc = vtc[vtc['error'].isnull()]

def get_stats(am, af):
    try:
        annotation = am.annotations[am.annotations['annotation_filename'] == af]
        segments = am.get_segments(annotation)
        segments = am.clip_segments(segments, 0, segments['min_duration'].min())
        print(af, segments.shape[0])
        df = am.get_vc_stats(segments).reset_index().assign(annotation_filename = af)
    except:
        df = pd.DataFrame()

    return df

if not os.path.exists('all_stats.csv'):
    pool = mp.Pool()
    all_stats = pool.map(
        partial(get_stats, am),
        vtc['annotation_filename'].tolist()
    )

    all_stats = pd.concat(all_stats)
    all_stats = all_stats.merge(vtc[['annotation_filename', 'recording_filename']], how = 'left', left_on = 'annotation_filename', right_on = 'annotation_filename')
    all_stats = all_stats.merge(recordings[['filename', 'child_id', 'date_iso', 'min_duration']], how = 'left', left_on = 'recording_filename', right_on = 'filename')
    all_stats.to_csv('all_stats.csv', index = False)

all_stats = pd.read_csv('all_stats.csv')

undergone_merge = pd.read_csv('metadata/merged.csv')
undergone_merge['recording'] = undergone_merge['filename'].str.extract(r"([0-9]+_(?:CW[0-9]+|NA)_CH[0-9]+_(?:AJ|FB|LM)[0-9]+_(?:AJ|FB|LM)[0-9]+_[0-9]{6})_") + '.WAV'
undergone_merge.dropna(inplace = True)
all_stats = all_stats[~all_stats['filename'].isin(undergone_merge['recording'].unique())]

grouped = all_stats.groupby(['child_id', 'date_iso', 'speaker_type'])
df = grouped.agg({
    'cum_dur': ['std', 'mean', 'count'],
    'voc_count': ['std', 'mean'],
    'turns': ['std', 'mean'],
    'cds_dur': ['std', 'mean'],
    'min_duration': 'min'
})
df = df[df['cum_dur']['count'] >= 2]
df = df[df['min_duration']['min'] >= 3600]

for metric in ['cum_dur', 'voc_count', 'turns', 'cds_dur']:
    df[metric + '_err'] = df[metric]['std']/df[metric]['mean']

metric = 'cds_dur'

df = df.reset_index()

print(df.groupby('speaker_type')[metric + '_err'].describe(percentiles=[0.5, 0.95]))
print(df.groupby('speaker_type')[metric + '_err'].agg([np.mean, np.median]))

# df[['speaker_type', metric + '_err']].groupby('speaker_type').boxplot(column = [metric + '_err'])
df[['speaker_type', metric + '_err']].hist(by = 'speaker_type', bins = 100, histtype = 'step')

scatter = all_stats[['speaker_type', 'child_id', 'date_iso', metric]].groupby(['speaker_type', 'child_id', 'date_iso'])[metric].apply(lambda x: x.tolist())
scatter = pd.DataFrame(scatter.tolist(), index=scatter.index)\
       .rename(columns=lambda x: x + 1)\
       .add_prefix('err')\
       .reset_index()

print(scatter)

fig, ax = plt.subplots(figsize=(8,6))
u, scatter['label_num'] = np.unique(scatter['speaker_type'], return_inverse=True)
sc = ax.scatter(x = 'err1', y = 'err2', c = 'label_num', data = scatter, marker = '+')

x = np.linspace(0, 100000, 2)
y = x
ax.plot(x, y, color = 'black', linestyle = 'dashed')

ax.legend(sc.legend_elements()[0], u, title = 'speaker_type')
ax.set_xlim(0, max(scatter['err1'].max(), scatter['err2'].max()))
ax.set_ylim(0, max(scatter['err1'].max(), scatter['err2'].max()))
ax.set_xlabel(metric + ' (recording 1)')
ax.set_ylabel(metric + ' (recording 2)')

plt.show()