LAAC-LSCP / datasets

DataLad superdataset including all the datasets currently managed by the LAAC/LSCP team
https://laac-lscp.github.io/ChildRecordsData/
2 stars 0 forks source link

Tsimane 2017 #17

Open lucasgautheron opened 3 years ago

lucasgautheron commented 3 years ago

Author: Camilia Ask: Camilia Record device: USB (twice per child), USB, LENA, olympus Raw data: /scratch1/projects/ac_lacie01/STRUCTURE/raw/tsimane2017/ Structured data: /scratch1/projects/ac_lacie01/STRUCTURE/tsimane2017_recordings/ Final output: /scratch1/data/laac_data/tsimane2017

lucasgautheron commented 3 years ago

separer 2017 et 2018

lucasgautheron commented 3 years ago
import pandas as pd
import datetime
import re

recordings = pd.read_excel('doc/recordings_metadata.xlsx')
recordings = recordings[['date', 'half_of_day_delivery', 'device', 'backup?', 'total time', 'chi_id', 'LENA_output_name']]
recordings.rename(columns = {
    'chi_id': 'child_id',
    'date': 'date_iso',
    'backup?': 'backup',
    'device': 'recording_device_id'  
}, inplace = True)

def date_to_iso(date):
    try:
        dt = datetime.datetime.strptime(str(date)[:10], "%Y-%d-%m")
    except:
        dt = datetime.datetime.strptime(str(date), "%d/%m/%Y")

    return dt.strftime("%Y-%m-%d")

def get_filename(row):
    print(row)
    return "tsimane2017_{}_{}.wav".format(
        row['child_id'],
        datetime.datetime.strptime(row['date_iso'], "%Y-%m-%d").strftime("%Y%m%d")
    )

recordings['date_iso'] = recordings['date_iso'].apply(date_to_iso)
recordings['start_time'] = 'NA'
recordings['filename'] = recordings.apply(get_filename, axis = 1)
recordings['recording_device_type'] = recordings['recording_device_id'].apply(lambda s: re.sub(r"[^a-zA-Z]", "", s.lower().split()[0]))

recordings['experiment'] = 'tsimane2017'
recordings.to_csv('recordings/recordings.csv')
import pandas as pd
import datetime

children = pd.read_excel('doc/recordings_metadata.xlsx')
children = children[['age_mo', 'exact_age', 'sex', 'child_pid', 'name', 'first_lm', 'second_lm', 'mother', 'mother_pid', 'father_pid', 'mother_ed', 'n_of_siblings', 'oder_of_birth', 'chi_id', 'mot_id', 'sib_id']]
children.rename(columns = {
    'chi_id': 'child_id',
    'exact_age': 'child_dob',
    'age_mo': 'age',
    'device': 'recording_device_id',
    'sex': 'child_sex',
    'oder_of_birth': 'order_of_birth'
}, inplace = True)

def date_to_iso(date):
    try:
        dt = datetime.datetime.strptime(str(date), "%d/%m/%Y")
    except:
        return 'NA'

    return dt.strftime("%Y-%m-%d")

children['child_dob'] = children['child_dob'].apply(date_to_iso)

children['experiment'] = 'tsimane2017'
children.to_csv('children.csv')
lucasgautheron commented 3 years ago

@alecristia can you confirm that the output in /scratch1/data/laac_data/tsimane2017 is correct and does not lack any file/information ?

alecristia commented 3 years ago

this is in progress with Camila!

By the way, I realized there are already VTCs for most 2017 files in /scratch1/projects/ac_lacie01/STRUCTURE/raw/tsimane2017 you can get rttm.zip (or its unzipped version, or the rttms themselves)

One file is missing: tsimane2017_C27_20170713.wav

(I think it was just skipped by mistake)

lucasgautheron commented 3 years ago

I've imported the VTC's :)

from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
import os

project = ChildProject('.')
am = AnnotationManager(project)

input = project.recordings[['filename']]
input.rename(columns = {'filename': 'recording_filename'}, inplace = True)
input = input[input['recording_filename'] != 'NA']
input['set'] = 'vtc'
input['time_seek'] = 0
input['range_onset'] = 0
input['range_offset'] = 0
input['raw_filename'] = input['recording_filename'].apply(lambda s: os.path.join('vtc', s.replace('.wav', '.rttm')))
input['format'] = 'vtc_rttm'

am.import_annotations(input)
alecristia commented 3 years ago

I just heard from Elika Bergelson that "there were 2 tsimane files where vtc gave exactly the same output: tsimane2017_C21_20170717 and tsimane2017_C23_20170719"

lucasgautheron commented 3 years ago

Only thing we can do is try to tell which child it belongs to (is there an intro ?)

lucasgautheron commented 3 years ago

See https://github.com/LAAC-LSCP/tsimane2017-data/issues/2 for further discussion with respect to that issue