its importation validation

lucasgautheron commented 3 years ago

Before engaging too far into EL1000, and also before we "release" our package, it is necessary to cross-check our its importation routine...

We'll use https://htanderson.github.io/ITSbin/index.html as a cross-check.

lucasgautheron commented 3 years ago

Q1.

In our package, lena_block_type is equal to the conversation_type for all segments that belong in a Conversation block, even non-human/speech segments, whereas the R package sets convType to NaN for these segments. Which way is better @alecristia ?

      convType       blkType spkr
0          NaN         Pause  TVF
1          NaN         Pause  SIL
2          NaN         Pause  TVF
3          NaN         Pause  SIL
4          NaN         Pause  NOF
5          NaN         Pause  SIL
6          NaN         Pause  TVF
7          NaN         Pause  NOF
8          NaN         Pause  SIL
9          NaN         Pause  NOF
...
21        AICF  Conversation  FAN
22         NaN  Conversation  NON
23         NaN  Conversation  OLN
24         NaN  Conversation  SIL
25        AICF  Conversation  FAN
26         NaN  Conversation  TVF
27        AICF  Conversation  FAN
28         NaN  Conversation  TVF
29        AICF  Conversation  FAN
30         NaN  Conversation  OLF
31         NaN  Conversation  NOF

lucasgautheron commented 3 years ago

Q2.

Currently the lists of cries, utterances and Vfxs (whatever that is) are stored in one column each, as a json, with the following format (e.g. for cries):

[{'startCry1': 8015.63, 'endCry1': 8016.25}, {'startCry2': 8016.77, 'endCry2': 8017.07}]

Is there any good reason why we should not do this instead ?

[{'start': 8015.63, 'end': 8016.25}, {'start': 8016.77, 'end': 8017.07}]

alecristia commented 3 years ago

Q1.

In our package, lena_block_type is equal to the conversation_type for all segments that belong in a Conversation block, even non-human/speech segments, whereas the R package sets convType to NaN for these segments. Which way is better @alecristia ?

      convType       blkType spkr
0          NaN         Pause  TVF
1          NaN         Pause  SIL
2          NaN         Pause  TVF
3          NaN         Pause  SIL
4          NaN         Pause  NOF
5          NaN         Pause  SIL
6          NaN         Pause  TVF
7          NaN         Pause  NOF
8          NaN         Pause  SIL
9          NaN         Pause  NOF
...
21        AICF  Conversation  FAN
22         NaN  Conversation  NON
23         NaN  Conversation  OLN
24         NaN  Conversation  SIL
25        AICF  Conversation  FAN
26         NaN  Conversation  TVF
27        AICF  Conversation  FAN
28         NaN  Conversation  TVF
29        AICF  Conversation  FAN
30         NaN  Conversation  OLF
31         NaN  Conversation  NOF

It seems more reasonable to me to keep the conversation type for all segments in a conversation block (even if they are non-speech). I understand why the other package may do this (eg to facilitate summing word counts in blocks) but conceptually it's more sensible to keep block identity stable within the block.

alecristia commented 3 years ago

Q2.

Currently the lists of cries, utterances and Vfxs (whatever that is) are stored in one column each, as a json, with the following format (e.g. for cries):
[{'startCry1': 8015.63, 'endCry1': 8016.25}, {'startCry2': 8016.77, 'endCry2': 8017.07}]
Is there any good reason why we should not do this instead ?
[{'start': 8015.63, 'end': 8016.25}, {'start': 8016.77, 'end': 8017.07}]

no reason I could think of -- there are only advantages to your proposed notation in my view.

LAAC-LSCP / ChildProject

its importation validation #137