LAAC-LSCP / laac-lscp.github.io

Lab book for LAAC team
https://laac-lscp.github.io/
0 stars 2 forks source link

Instructions feedback #2

Open LoannPeurey opened 2 years ago

LoannPeurey commented 2 years ago

Add comments to this issue for every piece of feedback you would like to provide in regard to the instructions in the handbook

alecristia commented 2 years ago
alecristia commented 2 years ago
alecristia commented 2 years ago

perhaps this applies also to the warning below https://laac-lscp.github.io/docs/create-a-new-dataset.html#create-the-metadata-from-the-its-information

alecristia commented 2 years ago

I wonder: I can see the instructions for creating metadata from LENA its, and from scratch. What if I have a dataset with both? in the tsimanem2018 case, I'm going to opt for not splitting & not using the its data for the metadata, even though it would be nice to have accurate timestamps... but it seems like it'll be a tremendous amount of hassle. This will probably happen incredibly infrequently - perhaps tsimane2017 & 2018 are the only datasets in the world with mixed recordings (and LENA are infrequent within those datasets)

alecristia commented 2 years ago

update https://childproject.readthedocs.io/en/latest/processors.html#basic-audio-conversion so that the examples match our current use, namely:

alecristia commented 2 years ago

I may be missing this, but where are the actual instructions for its import? In https://laac-lscp.github.io/docs/create-a-new-dataset.html#lena-its I see how to fix the timeline if you split recs, but if I don't, then is my script basically:

"""
This script uses metadata/recordings.csv to import its annotations. It assumes recordings.csv contains columns 'its_filename' and 'duration' to run the importation
"""
import pandas as pd
from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager

dataset_path = "."

#load the project and annotation manager
project = ChildProject(dataset_path)
am = AnnotationManager(project)

# we take a copy of the recordings.csv file of the dataset, that suits us because we have one importation per recording, as is usually the case with automated annotations
input_frame = pd.DataFrame.copy(project.recordings)
input_frame = input_frame.sort_values(['its_filename', 'recording_filename'])

#make sure that the duration for the recordings is set in recordings.csv, it should be imported with the metadata of the its
input_frame["raw_filename"]= input_frame['its_filename']
input_frame["set"] = 'its'
input_frame["range_onset"] = "0" #from the start of the audio...
input_frame["range_offset"]= input_frame["duration"] # ...to the end

#AC commented the following two lines for cases in which no split has been done
#for its, df in input_frame.groupby('its_filename'):
#    input_frame.loc[df.index,'time_seek'] = - df['duration'].cumsum().shift(1, fill_value = 0)

am.import_annotations(input_frame)
alecristia commented 2 years ago

I went ahead and ran that, and got this error:

Traceback (most recent call last): File "scripts/import_its.py", line 28, in am.import_annotations(input_frame) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 521, in import_annotations assert_columns_presence("input", input, required_columns) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/tables.py", line 31, in assert_columns_presence raise MissingColumnsException(name, missing) ChildProject.tables.MissingColumnsException: dataframe input misses the following required columns: time_seek

Note that my script has those two lines commented out, but it seems import_annotations requires time_seek (assert_columns_presence("input", input, required_columns)). So that means that probably somewhere in validation of the dataset, we need to tell users that it's an error if they're missing this column. And I guess we should tell people to add it somewhere.

I'm not sure where though -- annotations.csv in my case is empty & has that column...

alecristia commented 2 years ago

it would be nice to have an idea of how long processes last -- I know this is super dependent on hardware, but in my case, I was surprised that the tsimanem2018 data (which has only 350 files, from max 80 long-form recs) is taking over 20 mins to be standardized in the cluster... So perhaps we can give a couple of examples, so that people can plan? say one in which you only have 1 CPU, 1 16h rec, another with 4 CPUs 100 recs... just a thought!

LoannPeurey commented 2 years ago

I may be missing this, but where are the actual instructions for its import? In https://laac-lscp.github.io/docs/create-a-new-dataset.html#lena-its I see how to fix the timeline if you split recs, but if I don't, then is my script basically:

"""
This script uses metadata/recordings.csv to import its annotations. It assumes recordings.csv contains columns 'its_filename' and 'duration' to run the importation
"""
import pandas as pd
from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager

dataset_path = "."

#load the project and annotation manager
project = ChildProject(dataset_path)
am = AnnotationManager(project)

# we take a copy of the recordings.csv file of the dataset, that suits us because we have one importation per recording, as is usually the case with automated annotations
input_frame = pd.DataFrame.copy(project.recordings)
input_frame = input_frame.sort_values(['its_filename', 'recording_filename'])

#make sure that the duration for the recordings is set in recordings.csv, it should be imported with the metadata of the its
input_frame["raw_filename"]= input_frame['its_filename']
input_frame["set"] = 'its'
input_frame["range_onset"] = "0" #from the start of the audio...
input_frame["range_offset"]= input_frame["duration"] # ...to the end

#AC commented the following two lines for cases in which no split has been done
#for its, df in input_frame.groupby('its_filename'):
#    input_frame.loc[df.index,'time_seek'] = - df['duration'].cumsum().shift(1, fill_value = 0)

am.import_annotations(input_frame)

I think you actually should not comment those two lines, they take care of adding time_seek when the recordings are split but also if they are not. It will simply always give a time_seek of 0. which is what we want.

alecristia commented 2 years ago

that helped! but I get a new error:

Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'format' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "scripts/import_its.py", line 27, in am.import_annotations(input_frame) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 540, in import_annotations builtin = input[input["format"].isin(converters.keys())] File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/pandas/core/frame.py", line 3458, in getitem indexer = self.columns.get_loc(key) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc raise KeyError(key) from err KeyError: 'format'

LoannPeurey commented 2 years ago

obvious answer will be to add

input_frame['format'] = 'its'

But that surprises me because some importations were runnin fine without it. I'll take a deeper look soon

LoannPeurey commented 2 years ago

I wonder: I can see the instructions for creating metadata from LENA its, and from scratch. What if I have a dataset with both? in the tsimanem2018 case, I'm going to opt for not splitting & not using the its data for the metadata, even though it would be nice to have accurate timestamps... but it seems like it'll be a tremendous amount of hassle. This will probably happen incredibly infrequently - perhaps tsimane2017 & 2018 are the only datasets in the world with mixed recordings (and LENA are infrequent within those datasets)

In those cases, I guess the easiest way is to create 2 versions of metadata files, one generated with its files and one from scratch and then merging them. Maybe we should mention that possibility

alecristia commented 2 years ago
input_frame['format'] = 'its'

thanks! that got rid of that error, but a triad of new ones emerged. They look interesting:

type 1: trying to import an NA its

an error occured while processing './annotations/its/raw/NA' Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 437, in _import_annotation df = converter.convert(path, filter, **params) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/converters.py", line 265, in convert xml = etree.parse(filename) File "src/lxml/etree.pyx", line 3538, in lxml.etree.parse File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1902, in lxml.etree._parseDocumentFromURL File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocFromFile File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 652, in lxml.etree._raiseParseError OSError: Error reading file './annotations/its/raw/NA': failed to load external entity "./annotations/its/raw/NA"

type 2: NA error despite there being an actual its - oh wait, type 2 and 3 refer to the same file!

an error occured while processing './annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its' Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 437, in _import_annotation df = converter.convert(path, filter, **params) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/converters.py", line 265, in convert xml = etree.parse(filename) File "src/lxml/etree.pyx", line 3538, in lxml.etree.parse File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1902, in lxml.etree._parseDocumentFromURL File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocFromFile File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 652, in lxml.etree._raiseParseError OSError: Error reading file './annotations/its/raw/NA': failed to load external entity "./annotations/its/raw/NA"

Type 3: path repeated

Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 437, in _import_annotation df = converter.convert(path, filter, **params) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/converters.py", line 265, in convert xml = etree.parse(filename) File "src/lxml/etree.pyx", line 3538, in lxml.etree.parse File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1902, in lxml.etree._parseDocumentFromURL File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocFromFile File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 652, in lxml.etree._raiseParseError OSError: Error reading file './annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its': failed to load external entity "./annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its"

In this last one, notice the path in ... failed to load external entity "./annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its"

alecristia commented 2 years ago

full error log: https://gin.g-node.org/LAAC-LSCP/tsimanem2018/src/main/extra/fullerror.txt

LoannPeurey commented 2 years ago

This happens when some of the recordings don't have an its -- so if you have both LENA and non-LENA recordings in the same corpus. Then, the lines of recordings that don't have an its_filename should be removed:

input_frame.dropna(subset= 'its_filename', inplace=True)

Alternatively, if the NA is being treated as text rather than nan, then do:

input_frame = input_frame[input_frame.its_filename!= "NA"]
alecristia commented 2 years ago

and the path in "./annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its" -- came from actually writing "annotations/raw/its/" in the its_filename column

alecristia commented 2 years ago

now skipping to ALICE instructions: missing / in "in a scratch2/username subdirectory"

alecristia commented 2 years ago

I wonder if we should reorganize the content so people understand they can either run alice alone, or alice+vtc single file, or alice+vtc+mult audios, and after all these three, the "remarks" apply to all three, and the "You can then submit your job to slurm." as well (applies to all three options).

alecristia commented 2 years ago

"check the log file of the job to check its progression and possible errore (see the troubleshooting section right below)." typo, but also let's explain that the log file will be called alice + name of the job, and that they can check their job progress by doing squeue

alecristia commented 2 years ago

can it ever happen that a job doesn't get a log? I got # 239437 but it doesn't appear in squeue, and there is no log...

LoannPeurey commented 2 years ago

I wonder if we should reorganize the content so people understand they can either run alice alone, or alice+vtc single file, or alice+vtc+mult audios, and after all these three, the "remarks" apply to all three, and the "You can then submit your job to slurm." as well (applies to all three options).

I'd love suggestions to make it more understandable, not sure how to change it

LoannPeurey commented 2 years ago

can it ever happen that a job doesn't get a log? I got # 239437 but it doesn't appear in squeue, and there is no log...

It should never happen, sometimes the log takes a bit of time to be generated

alecristia commented 1 year ago

could we pls set the output of validate to act more like what datalad does, where it says "and XX more warnings like it", so eg instead of printing out 100 lines about a recording not having a standard version, it'll print it for 1 file, and tell you that there are 99 more lines like it. Presumably then you'll do a fix that will fix all of them at once!

alecristia commented 1 year ago

could we add in the team's handbook that we prefer people cloning from oberon? Should we spell out there what to do for the small number of people in the team (camila & I) who are dataset curators, that they should have edit access on scratch1?

alecristia commented 1 year ago

might it not be simpler to replace https://laac-lscp.github.io/docs/running-models.html#check-that-audio-files-content-is-present & the following section with the instruction to validate the dataset, and saying "if you get "warning, standard not present" then convert the audio". Particularly bc when I did ls rec/raw I got a list of folders, so then we'd have to explain how to check down the tree...

alecristia commented 1 year ago

"VTC IS CURRENTLY BEING REWORKED BY THE COML TEAM SO A NEW VERSION (AND EASIER WAY TO RUN THE MODEL) SHOULD SOON REPLACE THIS TUTORIAL." we can remove this... to my knowledge, since we're always going to be using pre-trained VTC, this is irrelevant to us

alecristia commented 1 year ago

"TO EDIT IN THE SCRIPT:" <-- add "BELOW" (or even a link to the script, if we can manage that). I can imagine certain people getting very annoyed "which script??" Also, perhaps consider linking "the path of your brand new environment" with a backlink to the section explaining how to create it. I'd remove: "profile : if had to convert your audios, put converted/standard or the name of the profile to use, otherwise keep the default raw" and set the default to be standard -- Oh wait I see! shouldn't we simplify and always tell people to convert? honestly, even if their recs are already 16k, it seems reasonable to standardize also the encoder, bit rate, etc.

alecristia commented 1 year ago

hmm sorry, back-tracking:

We can’t run the model directly on oberon as we need more ressources, so we will launch a slurm job on oberon. You can use on of these script templates as a good starting point that you can then customize to your needs. Save the file in your ALICE folder as job-alice.sh for example and modify it for how you want to run alice.

ressources -> resources You can use on of these script templates -> You can use one of these script templates

in that text, add a link to "these script templates". Split the sentence, to introduce slurm job as something the reader may not know, and tell them that the link goes to the coml internal wiki, to which they can log in with their oberon creds.

alecristia commented 1 year ago

sorry, a little earlier when installing ALICE: cd /scratch2/username/modules --> cd /scratch2/username/shared/modules right?

alecristia commented 1 year ago

is there any way that in this ALICE section, we can make the user choose their path, the way you did for windows/mac/linux, so that they only see the instructions for their chosen path (alice+vtc; alice alone after vtc with an rttm/file, alice alone after multi-file vtc)?

Also, my guess is that we would help by giving them the 2 errors they can get, and why they are:

alecristia commented 1 year ago

the next one is trickier. I got "cannot access XXXX/converted/standard/.wav" but this was because wavs are inside folders. Shall we make `files=($(ls ${audio_path}/.wav))` recursive? or instruct readers to fix the path? My preference is for the former, unless you can see issues arising from that?

alecristia commented 1 year ago

i suggest that the job-alice.sh is done in a shell, so people can close their window even when it's not finished running

alecristia commented 1 year ago

the next one is trickier. I got "cannot access XXXX/converted/standard/.wav" but this was because wavs are inside folders. Shall we make `files=($(ls ${audio_path}/.wav))` recursive? or instruct readers to fix the path? My preference is for the former, unless you can see issues arising from that?

specifically like so:

files=($(find ${audio_path} -name "*.wav")) #list wav files in the audio_path
alecristia commented 1 year ago

I just installed my environment & could launch job-alice.sh to do both vtc and alice. I got an error, where the end of the traceback is:

ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /scratch2/acristia/shared/modules/ALICE/conda-alice-env/lib/python3.6/site-packages/google/protobuf/pyext/_message.cpython-36m-x86_64-linux-gnu.so) Something went wrong when applying the model Aborting.

I tried to fix that with: conda install libgcc, but it didn't work. strings /lib64/libstdc++.so.6 | grep GLIBCXX still shows me 3.4.21 is not where the system is expecting it.

We should also change error management, because I got:

cp: cannot stat '/scratch2/acristia/shared/modules/ALICE/output_voice_type_classifier/tmp_data/all.rttm': No such file or directory ALICE completed. Results written to /scratch2/acristia/shared/modules/ALICE/ALICE_output.txt and /scratch2/acristia/shared/modules/ALICE/diarization_output.rttm. mv: cannot stat 'diarization_output.rttm': No such file or directory REC001 finished, 1/440 files

The last line lets one understand the file was done, when in fact it wasn't... Not sure how to catch this?

LoannPeurey commented 1 year ago

ImportError, we have seen this one before, easiest way is to install the new version og libgcc in conda (like you seem to have done) and then use this line export LD_LIBRARY_PATH=/MY_CONDA_PATH/lib:$LD_LIBRARY_PATH where MY_CONDA_PATH is the path to the conda environment where libgcc is installed.

The error management is not great but is very much linked to the ALICE launching script, so to edit it, we will need to dive into the ALICE repo quite substantially.

LoannPeurey commented 1 year ago

might it not be simpler to replace https://laac-lscp.github.io/docs/running-models.html#check-that-audio-files-content-is-present & the following section with the instruction to validate the dataset, and saying "if you get "warning, standard not present" then convert the audio". Particularly bc when I did ls rec/raw I got a list of folders, so then we'd have to explain how to check down the tree...

I think this was written before we had a check for sampling rate standard in the validate so it makes sens. And indeed the command needs to be slightly edited to recurse down in the tree (changing p=recording/raw/* would do it here I think). So this instruction should be replaced

LoannPeurey commented 1 year ago

in this section, we're missing the actual command: bash job-alice.sh

i suggest that the job-alice.sh is done in a shell, so people can close their window even when it's not finished running

Should be sbatch job-alice.sh and so submitting the job allows them to close the window even when not finished. If you did bash and are not into a compute node (puck1/4/5) it should fail when finding no gpu.

LoannPeurey commented 1 year ago

Thank you for all the comments !! I will put all of the ones to improve clarity in place. For a lot of comments, I think it is putting light on the limitations we have:

I am not sure what approach we should take. Bear in mind that the shell scripts can still be improved by, for example, not having to modify them and accepting running arguments instead. But for example, recursing into directories and listing all wav will lead to unindexed and discarded files that are not considered by CP to still be analyzed by the model, which is suboptimal.

alecristia commented 1 year ago

Thank you for all the comments !! I will put all of the ones to improve clarity in place. For a lot of comments, I think it is putting light on the limitations we have:

  • Bash scripts are not super flexible, they are limiting our ability to use childproject's built in functions to select the audios, put the output in the right place right away etc...
  • Using childproject functions to compensate that would help BUT, it would render this guide completely useless to run vtc/alice/vcm on data not formatted into chidproject.

I am not sure what approach we should take. Bear in mind that the shell scripts can still be improved by, for example, not having to modify them and accepting running arguments instead. But for example, recursing into directories and listing all wav will lead to unindexed and discarded files that are not considered by CP to still be analyzed by the model, which is suboptimal.

anything that forces us to use childproject well may feel annoying but it's ultimately for our own good. I'll also add that I like having a script rather than accepting arguments, because this at least allows a record of what you did.

Not sure these reactions help in your decision-making!

alecristia commented 1 year ago

"find in your output_path your alice annotations" --> "find in your output_path (the dataset_path you modified in the script followed by annotations/alice your alice annotations"

alecristia commented 1 year ago

maybe this suggestion is rather for the childproject installation instructions, but perhaps the section that creates the env should end by deleting env.yml?

alecristia commented 1 year ago

same: "If you which to continue using directly Windows, you must do the following:" --> "If you wish to continue using directly Windows, you must do the following:" (but the wording is a bit mean -- so perhaps: If you nonetheless want to try using ChildProject within Windows, please do the following)

alecristia commented 1 year ago

I'm trying to install the childproject env in one of the 3 computers i work with.

when trying to create the childproject environment I got:

gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/alejandrinacristia/miniconda3/envs/childproject/include -arch x86_64 -I/Users/alejandrinacristia/miniconda3/envs/childproject/include -arch x86_64 -Ilibsoxr/src -Isrc/soxr -I/private/var/folders/zl/8rc2cx7s2hx6w6c0br6x8swh0000gn/T/pip-build-env-8hvf9ktd/overlay/lib/python3.7/site-packages/numpy/core/include -I/private/var/folders/zl/8rc2cx7s2hx6w6c0br6x8swh0000gn/T/pip-build-env-8hvf9ktd/overlay/lib/python3.7/site-packages/numpy/core/include -I/private/var/folders/zl/8rc2cx7s2hx6w6c0br6x8swh0000gn/T/pip-build-env-8hvf9ktd/overlay/lib/python3.7/site-packages/numpy/core/include -I/Users/alejandrinacristia/miniconda3/envs/childproject/include/python3.7m -c libsoxr/src/cr.c -o build/temp.macosx-10.9-x86_64-cpython-37/libsoxr/src/cr.o -DSOXR_LIB xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun error: command '/usr/bin/gcc' failed with exit code 1 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for soxr ERROR: Could not build wheels for soxr, which is required to install pyproject.toml-based projects

my guess is that this has to do with my xcode being outdated, so i'll try installing it, then

alecristia commented 1 year ago

the EL1000 package is mentioned here, which is great. Consider also adding it when discussing data importation, since that's the context I was hoping to use it in.

Then the one line in the readme in the el1000 package should have the address changed EL1000 --> LAAC-LSCP

LoannPeurey commented 1 year ago

scripts included as text in the page. => not always clear that needs to be copied and lunched, clarify how to use them