LAAC-LSCP / ChildProject

Python package for the management of day-long recordings of children.
https://childproject.readthedocs.io
MIT License
13 stars 5 forks source link

[FEATURE] Compute age at the time of the recording #325

Closed MarvinLvn closed 2 years ago

MarvinLvn commented 2 years ago

We currently don't have access to the age information in recordings.csv To compute it, we need 'dob' from children.csv and 'date_iso.csv' from recordings.csv itself. It'd be particularly useful for longitudinal studies :)

Please, ignore this if the feature already exists (in which case, I'm sorry, I missed it).

For Lyon, I did it myself, using:

import pandas as pd
from pathlib import Path
import numpy as np
import datetime as dt

# Read metadata files
metadata_path = Path("/scratch2/mlavechin/Longforms/FR/lyon/metadata")
children = pd.read_csv(metadata_path / 'children.csv')
recordings = pd.read_csv(metadata_path / 'recordings.csv')

# Import the dob column from children.csv and convert dob et child_id to datetime
recordings = recordings.merge(children[['child_id', 'dob']])
recordings.dob = pd.to_datetime(recordings.dob)
recordings.date_iso = pd.to_datetime(recordings.date_iso)

# Compute age in months and remove dob column
recordings['age'] = ((recordings.date_iso - recordings.dob) / np.timedelta64(1, 'M')).astype(int)
recordings.drop('dob',inplace=True, axis=1)

EDIT : I also realized that you may not want to go in that direction as the number of features to implement would be virtually infinite *_*

lucasgautheron commented 2 years ago

I think it would make sense to have such a feature.

However, pd.to_datetime will fail for dates that are too far back, e.g. 01/01/1000. You may ask why this is relevant. Well, in EL1000 for instance, some of the dates have been "anonymised" like that. This has the merit to make it obvious that the dates are not the real dates, but instead made up to keep ages consistent while protecting the privacy of the participants. This is solved by using datetime.datetime.strptime instead.

(saying this for whoever wanna implement this)

Regarding the implementation, maybe ChildProject.projects.ChildProject should have a compute_ages method that returns a pd.Series with the computed age.

e.g.

project = ChildProject('.')
project.read()
project.children['age'] = project.compute_ages()

but the function should also accept custom dataframes maybe:

project = ChildProject('.')
project.read()
children = project.children.copy()
recordings = project.recordings.copy()
# work on the dataframes...
# ...
children['age'] = project.compute_ages(children = children, recordings = recordings)