Open LoannPeurey opened 2 years ago
- cosmetic : https://laac-lscp.github.io/docs/create-a-new-dataset.html#audio-files : I suspect the text following the word WARNING was intended to be put inside the same yellow box
perhaps this applies also to the warning below https://laac-lscp.github.io/docs/create-a-new-dataset.html#create-the-metadata-from-the-its-information
I wonder: I can see the instructions for creating metadata from LENA its, and from scratch. What if I have a dataset with both? in the tsimanem2018 case, I'm going to opt for not splitting & not using the its data for the metadata, even though it would be nice to have accurate timestamps... but it seems like it'll be a tremendous amount of hassle. This will probably happen incredibly infrequently - perhaps tsimane2017 & 2018 are the only datasets in the world with mixed recordings (and LENA are infrequent within those datasets)
update https://childproject.readthedocs.io/en/latest/processors.html#basic-audio-conversion so that the examples match our current use, namely:
I may be missing this, but where are the actual instructions for its import? In https://laac-lscp.github.io/docs/create-a-new-dataset.html#lena-its I see how to fix the timeline if you split recs, but if I don't, then is my script basically:
"""
This script uses metadata/recordings.csv to import its annotations. It assumes recordings.csv contains columns 'its_filename' and 'duration' to run the importation
"""
import pandas as pd
from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
dataset_path = "."
#load the project and annotation manager
project = ChildProject(dataset_path)
am = AnnotationManager(project)
# we take a copy of the recordings.csv file of the dataset, that suits us because we have one importation per recording, as is usually the case with automated annotations
input_frame = pd.DataFrame.copy(project.recordings)
input_frame = input_frame.sort_values(['its_filename', 'recording_filename'])
#make sure that the duration for the recordings is set in recordings.csv, it should be imported with the metadata of the its
input_frame["raw_filename"]= input_frame['its_filename']
input_frame["set"] = 'its'
input_frame["range_onset"] = "0" #from the start of the audio...
input_frame["range_offset"]= input_frame["duration"] # ...to the end
#AC commented the following two lines for cases in which no split has been done
#for its, df in input_frame.groupby('its_filename'):
# input_frame.loc[df.index,'time_seek'] = - df['duration'].cumsum().shift(1, fill_value = 0)
am.import_annotations(input_frame)
I went ahead and ran that, and got this error:
Traceback (most recent call last): File "scripts/import_its.py", line 28, in
am.import_annotations(input_frame) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 521, in import_annotations assert_columns_presence("input", input, required_columns) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/tables.py", line 31, in assert_columns_presence raise MissingColumnsException(name, missing) ChildProject.tables.MissingColumnsException: dataframe input misses the following required columns: time_seek
Note that my script has those two lines commented out, but it seems import_annotations requires time_seek (assert_columns_presence("input", input, required_columns)
). So that means that probably somewhere in validation of the dataset, we need to tell users that it's an error if they're missing this column. And I guess we should tell people to add it somewhere.
I'm not sure where though -- annotations.csv in my case is empty & has that column...
it would be nice to have an idea of how long processes last -- I know this is super dependent on hardware, but in my case, I was surprised that the tsimanem2018 data (which has only 350 files, from max 80 long-form recs) is taking over 20 mins to be standardized in the cluster... So perhaps we can give a couple of examples, so that people can plan? say one in which you only have 1 CPU, 1 16h rec, another with 4 CPUs 100 recs... just a thought!
I may be missing this, but where are the actual instructions for its import? In https://laac-lscp.github.io/docs/create-a-new-dataset.html#lena-its I see how to fix the timeline if you split recs, but if I don't, then is my script basically:
""" This script uses metadata/recordings.csv to import its annotations. It assumes recordings.csv contains columns 'its_filename' and 'duration' to run the importation """ import pandas as pd from ChildProject.projects import ChildProject from ChildProject.annotations import AnnotationManager dataset_path = "." #load the project and annotation manager project = ChildProject(dataset_path) am = AnnotationManager(project) # we take a copy of the recordings.csv file of the dataset, that suits us because we have one importation per recording, as is usually the case with automated annotations input_frame = pd.DataFrame.copy(project.recordings) input_frame = input_frame.sort_values(['its_filename', 'recording_filename']) #make sure that the duration for the recordings is set in recordings.csv, it should be imported with the metadata of the its input_frame["raw_filename"]= input_frame['its_filename'] input_frame["set"] = 'its' input_frame["range_onset"] = "0" #from the start of the audio... input_frame["range_offset"]= input_frame["duration"] # ...to the end #AC commented the following two lines for cases in which no split has been done #for its, df in input_frame.groupby('its_filename'): # input_frame.loc[df.index,'time_seek'] = - df['duration'].cumsum().shift(1, fill_value = 0) am.import_annotations(input_frame)
I think you actually should not comment those two lines, they take care of adding time_seek when the recordings are split but also if they are not. It will simply always give a time_seek of 0. which is what we want.
that helped! but I get a new error:
Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'format' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "scripts/import_its.py", line 27, in
am.import_annotations(input_frame) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 540, in import_annotations builtin = input[input["format"].isin(converters.keys())] File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/pandas/core/frame.py", line 3458, in getitem indexer = self.columns.get_loc(key) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc raise KeyError(key) from err KeyError: 'format'
obvious answer will be to add
input_frame['format'] = 'its'
But that surprises me because some importations were runnin fine without it. I'll take a deeper look soon
I wonder: I can see the instructions for creating metadata from LENA its, and from scratch. What if I have a dataset with both? in the tsimanem2018 case, I'm going to opt for not splitting & not using the its data for the metadata, even though it would be nice to have accurate timestamps... but it seems like it'll be a tremendous amount of hassle. This will probably happen incredibly infrequently - perhaps tsimane2017 & 2018 are the only datasets in the world with mixed recordings (and LENA are infrequent within those datasets)
In those cases, I guess the easiest way is to create 2 versions of metadata files, one generated with its files and one from scratch and then merging them. Maybe we should mention that possibility
input_frame['format'] = 'its'
thanks! that got rid of that error, but a triad of new ones emerged. They look interesting:
an error occured while processing './annotations/its/raw/NA' Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 437, in _import_annotation df = converter.convert(path, filter, **params) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/converters.py", line 265, in convert xml = etree.parse(filename) File "src/lxml/etree.pyx", line 3538, in lxml.etree.parse File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1902, in lxml.etree._parseDocumentFromURL File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocFromFile File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 652, in lxml.etree._raiseParseError OSError: Error reading file './annotations/its/raw/NA': failed to load external entity "./annotations/its/raw/NA"
an error occured while processing './annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its' Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 437, in _import_annotation df = converter.convert(path, filter, **params) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/converters.py", line 265, in convert xml = etree.parse(filename) File "src/lxml/etree.pyx", line 3538, in lxml.etree.parse File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1902, in lxml.etree._parseDocumentFromURL File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocFromFile File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 652, in lxml.etree._raiseParseError OSError: Error reading file './annotations/its/raw/NA': failed to load external entity "./annotations/its/raw/NA"
Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/annotations.py", line 437, in _import_annotation df = converter.convert(path, filter, **params) File "/scratch1/home/acristia/ChildProjectVenv/lib/python3.7/site-packages/ChildProject/converters.py", line 265, in convert xml = etree.parse(filename) File "src/lxml/etree.pyx", line 3538, in lxml.etree.parse File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1902, in lxml.etree._parseDocumentFromURL File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocFromFile File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 652, in lxml.etree._raiseParseError OSError: Error reading file './annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its': failed to load external entity "./annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its"
In this last one, notice the path in ... failed to load external entity "./annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its"
This happens when some of the recordings don't have an its -- so if you have both LENA and non-LENA recordings in the same corpus. Then, the lines of recordings that don't have an its_filename should be removed:
input_frame.dropna(subset= 'its_filename', inplace=True)
Alternatively, if the NA is being treated as text rather than nan, then do:
input_frame = input_frame[input_frame.its_filename!= "NA"]
and the path in "./annotations/its/raw/annotations/raw/its/e20180716_170035_013097.its" -- came from actually writing "annotations/raw/its/" in the its_filename column
now skipping to ALICE instructions: missing / in "in a scratch2/username
subdirectory"
I wonder if we should reorganize the content so people understand they can either run alice alone, or alice+vtc single file, or alice+vtc+mult audios, and after all these three, the "remarks" apply to all three, and the "You can then submit your job to slurm." as well (applies to all three options).
"check the log file of the job to check its progression and possible errore (see the troubleshooting section right below)." typo, but also let's explain that the log file will be called alice + name of the job, and that they can check their job progress by doing squeue
can it ever happen that a job doesn't get a log? I got # 239437 but it doesn't appear in squeue, and there is no log...
I wonder if we should reorganize the content so people understand they can either run alice alone, or alice+vtc single file, or alice+vtc+mult audios, and after all these three, the "remarks" apply to all three, and the "You can then submit your job to slurm." as well (applies to all three options).
I'd love suggestions to make it more understandable, not sure how to change it
can it ever happen that a job doesn't get a log? I got # 239437 but it doesn't appear in squeue, and there is no log...
It should never happen, sometimes the log takes a bit of time to be generated
could we pls set the output of validate to act more like what datalad does, where it says "and XX more warnings like it", so eg instead of printing out 100 lines about a recording not having a standard version, it'll print it for 1 file, and tell you that there are 99 more lines like it. Presumably then you'll do a fix that will fix all of them at once!
could we add in the team's handbook that we prefer people cloning from oberon? Should we spell out there what to do for the small number of people in the team (camila & I) who are dataset curators, that they should have edit access on scratch1?
might it not be simpler to replace https://laac-lscp.github.io/docs/running-models.html#check-that-audio-files-content-is-present & the following section with the instruction to validate the dataset, and saying "if you get "warning, standard not present" then convert the audio".
Particularly bc when I did ls rec/raw
I got a list of folders, so then we'd have to explain how to check down the tree...
"VTC IS CURRENTLY BEING REWORKED BY THE COML TEAM SO A NEW VERSION (AND EASIER WAY TO RUN THE MODEL) SHOULD SOON REPLACE THIS TUTORIAL." we can remove this... to my knowledge, since we're always going to be using pre-trained VTC, this is irrelevant to us
"TO EDIT IN THE SCRIPT:" <-- add "BELOW" (or even a link to the script, if we can manage that). I can imagine certain people getting very annoyed "which script??" Also, perhaps consider linking "the path of your brand new environment" with a backlink to the section explaining how to create it. I'd remove: "profile : if had to convert your audios, put converted/standard or the name of the profile to use, otherwise keep the default raw" and set the default to be standard -- Oh wait I see! shouldn't we simplify and always tell people to convert? honestly, even if their recs are already 16k, it seems reasonable to standardize also the encoder, bit rate, etc.
hmm sorry, back-tracking:
We can’t run the model directly on oberon as we need more ressources, so we will launch a slurm job on oberon. You can use on of these script templates as a good starting point that you can then customize to your needs. Save the file in your ALICE folder as job-alice.sh for example and modify it for how you want to run alice.
ressources -> resources You can use on of these script templates -> You can use one of these script templates
in that text, add a link to "these script templates". Split the sentence, to introduce slurm job as something the reader may not know, and tell them that the link goes to the coml internal wiki, to which they can log in with their oberon creds.
sorry, a little earlier when installing ALICE: cd /scratch2/username/modules
--> cd /scratch2/username/shared/modules
right?
is there any way that in this ALICE section, we can make the user choose their path, the way you did for windows/mac/linux, so that they only see the instructions for their chosen path (alice+vtc; alice alone after vtc with an rttm/file, alice alone after multi-file vtc)?
Also, my guess is that we would help by giving them the 2 errors they can get, and why they are:
the next one is trickier. I got "cannot access XXXX/converted/standard/.wav" but this was because wavs are inside folders. Shall we make `files=($(ls ${audio_path}/.wav))` recursive? or instruct readers to fix the path? My preference is for the former, unless you can see issues arising from that?
i suggest that the job-alice.sh is done in a shell, so people can close their window even when it's not finished running
the next one is trickier. I got "cannot access XXXX/converted/standard/.wav" but this was because wavs are inside folders. Shall we make `files=($(ls ${audio_path}/.wav))` recursive? or instruct readers to fix the path? My preference is for the former, unless you can see issues arising from that?
specifically like so:
files=($(find ${audio_path} -name "*.wav")) #list wav files in the audio_path
I just installed my environment & could launch job-alice.sh to do both vtc and alice. I got an error, where the end of the traceback is:
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /scratch2/acristia/shared/modules/ALICE/conda-alice-env/lib/python3.6/site-packages/google/protobuf/pyext/_message.cpython-36m-x86_64-linux-gnu.so) Something went wrong when applying the model Aborting.
I tried to fix that with: conda install libgcc
, but it didn't work. strings /lib64/libstdc++.so.6 | grep GLIBCXX
still shows me 3.4.21 is not where the system is expecting it.
We should also change error management, because I got:
cp: cannot stat '/scratch2/acristia/shared/modules/ALICE/output_voice_type_classifier/tmp_data/all.rttm': No such file or directory ALICE completed. Results written to /scratch2/acristia/shared/modules/ALICE/ALICE_output.txt and /scratch2/acristia/shared/modules/ALICE/diarization_output.rttm. mv: cannot stat 'diarization_output.rttm': No such file or directory REC001 finished, 1/440 files
The last line lets one understand the file was done, when in fact it wasn't... Not sure how to catch this?
ImportError, we have seen this one before, easiest way is to install the new version og libgcc in conda (like you seem to have done) and then use this line export LD_LIBRARY_PATH=/MY_CONDA_PATH/lib:$LD_LIBRARY_PATH
where MY_CONDA_PATH is the path to the conda environment where libgcc is installed.
The error management is not great but is very much linked to the ALICE launching script, so to edit it, we will need to dive into the ALICE repo quite substantially.
might it not be simpler to replace https://laac-lscp.github.io/docs/running-models.html#check-that-audio-files-content-is-present & the following section with the instruction to validate the dataset, and saying "if you get "warning, standard not present" then convert the audio". Particularly bc when I did
ls rec/raw
I got a list of folders, so then we'd have to explain how to check down the tree...
I think this was written before we had a check for sampling rate standard in the validate so it makes sens. And indeed the command needs to be slightly edited to recurse down in the tree (changing p=recording/raw/* would do it here I think). So this instruction should be replaced
in this section, we're missing the actual command:
bash job-alice.sh
i suggest that the job-alice.sh is done in a shell, so people can close their window even when it's not finished running
Should be sbatch job-alice.sh
and so submitting the job allows them to close the window even when not finished. If you did bash
and are not into a compute node (puck1/4/5) it should fail when finding no gpu.
Thank you for all the comments !! I will put all of the ones to improve clarity in place. For a lot of comments, I think it is putting light on the limitations we have:
I am not sure what approach we should take. Bear in mind that the shell scripts can still be improved by, for example, not having to modify them and accepting running arguments instead. But for example, recursing into directories and listing all wav will lead to unindexed and discarded files that are not considered by CP to still be analyzed by the model, which is suboptimal.
Thank you for all the comments !! I will put all of the ones to improve clarity in place. For a lot of comments, I think it is putting light on the limitations we have:
- Bash scripts are not super flexible, they are limiting our ability to use childproject's built in functions to select the audios, put the output in the right place right away etc...
- Using childproject functions to compensate that would help BUT, it would render this guide completely useless to run vtc/alice/vcm on data not formatted into chidproject.
I am not sure what approach we should take. Bear in mind that the shell scripts can still be improved by, for example, not having to modify them and accepting running arguments instead. But for example, recursing into directories and listing all wav will lead to unindexed and discarded files that are not considered by CP to still be analyzed by the model, which is suboptimal.
anything that forces us to use childproject well may feel annoying but it's ultimately for our own good. I'll also add that I like having a script rather than accepting arguments, because this at least allows a record of what you did.
Not sure these reactions help in your decision-making!
"find in your output_path your alice annotations" --> "find in your output_path (the dataset_path you modified in the script followed by annotations/alice
your alice annotations"
maybe this suggestion is rather for the childproject installation instructions, but perhaps the section that creates the env should end by deleting env.yml?
same: "If you which to continue using directly Windows, you must do the following:" --> "If you wish to continue using directly Windows, you must do the following:" (but the wording is a bit mean -- so perhaps: If you nonetheless want to try using ChildProject within Windows, please do the following)
I'm trying to install the childproject env in one of the 3 computers i work with.
when trying to create the childproject environment I got:
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/alejandrinacristia/miniconda3/envs/childproject/include -arch x86_64 -I/Users/alejandrinacristia/miniconda3/envs/childproject/include -arch x86_64 -Ilibsoxr/src -Isrc/soxr -I/private/var/folders/zl/8rc2cx7s2hx6w6c0br6x8swh0000gn/T/pip-build-env-8hvf9ktd/overlay/lib/python3.7/site-packages/numpy/core/include -I/private/var/folders/zl/8rc2cx7s2hx6w6c0br6x8swh0000gn/T/pip-build-env-8hvf9ktd/overlay/lib/python3.7/site-packages/numpy/core/include -I/private/var/folders/zl/8rc2cx7s2hx6w6c0br6x8swh0000gn/T/pip-build-env-8hvf9ktd/overlay/lib/python3.7/site-packages/numpy/core/include -I/Users/alejandrinacristia/miniconda3/envs/childproject/include/python3.7m -c libsoxr/src/cr.c -o build/temp.macosx-10.9-x86_64-cpython-37/libsoxr/src/cr.o -DSOXR_LIB xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun error: command '/usr/bin/gcc' failed with exit code 1 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for soxr ERROR: Could not build wheels for soxr, which is required to install pyproject.toml-based projects
my guess is that this has to do with my xcode being outdated, so i'll try installing it, then
scripts included as text in the page. => not always clear that needs to be copied and lunched, clarify how to use them
Add comments to this issue for every piece of feedback you would like to provide in regard to the instructions in the handbook