Closed alecristia closed 3 years ago
Hi, I'm trying to create a cheatsheet for myself for pushing extracting & clips onto zooniverse. I'll always do this on oberon, so I'll only think of that case.
So far I have:
datalad install git@github.com:LAAC-LSCP/solomon-data.git cd solomon-data source ~/ChildProjectVenv/bin/activate datalad run-procedure setup
But that last step fails:
I think you may have accidentally run datalad run-procedure setup foberon
(probably a copy paste from the doc) right before datalad run-procedure setup
(which worked, as you can see, hence why the following steps are working)
Just in case, I pushed along and the next step worked:
datalad get recordings/converted
I'm uncertain as to the following step. I think I should select segments somehow -- but in the docs the next step is the chunkification of the segments.
It is true that you need to provide segments before, which the doc currently does not clearly state. But if you do not provide these segments, you won't be able to go any further.
About chunkification, I currently have this draft of the command:
cd .. #to find myself at the same level as solomon-data, since just now I was inside solomon-data child-project zooniverse extract-chunks solomon-data --keyword talkerNtype --chunks-length 500 --segments segments.csv --destination solomon-data/annotations/zooniverse/raw --batch-size 1000
Should destination be inside solomon-data or somewhere else? What happens if I leave it unspecified? Same for chunks-length & batch-size. Could/should we have a default behavior that means that the chunks will be created inside solomon-data in some place such that the actual mp3/wavs don't get included in the data but the metadata etc does?
You need to define a destination (otherwise, an error will be thrown, and the script will stop)
It is up to the user to decide where to store the output. It might not be within solomon-data; e.g., if you are developing an analysis aside, you may have imported solomon-data as a subdataset, and the chunks will preferably lie somewhere in your analysis folder. We do not expect every user to push their own chunks to the original dataset in the general case. (Also, honestly, the audio chunks do not need to be kept once they have been uploaded.)
However, in case you want to push the chunks to the dataset, I am not sure annotations
is the best fitting place for that. I think a better design would be to create a 'samples' subfolder, as in here: https://github.com/LAAC-LSCP/ChildProject/issues/148#issue-816256844
You could then set the destination to solomon-data/samples/high-volubility/chunks
for instance
What is batch-size, actually? Why is it declared in the chunkification stage in addition to the upload stage? I just saw in the upload stage this is optional -- shouldn't it be mandatory (or have a default of 1000) in the upload stage?
Batch size defines how many of the chunks will be grouped and uploaded together. This reproduces the behavior of Chiara's script. This is apparently needed because of Zooniverse upload rate quotas. --batch-size defines how many chunks should each batch contain. At the upload step, you need to define how many of these batches will be uploaded. This way, you can upload n batches the first day, then n more batches the second day, etc. Maybe we could avoid that, and have only one option to specify how many chunks should be uploaded during the upload - in this case, we could drop the batch system. Let me rethink this!
The step after that is chunk upload. Here is my command draft:
child-project zooniverse upload-chunks solomon-data --chunks solomon-data/annotations/zooniverse/raw/chunks.csv --project-id 14957 --set-prefix ac_20210408
Have we decided on a naming convention for the prefix?
We have not. Should we ?
The next step is to create a record that I did this, but updating the data. Here is my command draft for that:
cd solomon-data datalad save annotations/zooniverse/raw -m "adding record of zoo chunks"
This is the equivalent of git add & git commit; you will also need to push the data at some point (datalad push
)
Eventually, I'll get classifications:
child-project zooniverse retrieve-classifications solomon-data --project-id 14957
And repeat the data update.
cd solomon-data datalad save annotations/zooniverse/raw -m "adding record of zoo chunks - annotated"
You will also have to set the destination for child-project zooniverse retrieve-classifications
, e.g.:
child-project zooniverse retrieve-classifications solomon-data --destination solomon-data/samples/high-volubility/classifications_2021-04-10.csv --project-id XXX
PS: you don't need to cd out of solomon-data; you could just do child-project validate .
for instance
PPS: if you write a cheatsheet for Zooniverse, can you please share it ? Then I can adapt it into a tutorial for the docs.
I have found a workaround to avoid the batch system, which I implemented in #182 .
You can try it by installing the package from:
pip install git+https://github.com/LAAC-LSCP/ChildProject.git@zooniverse/improvements --upgrade
Below you can find the upgraded documentation:
$ child-project zooniverse extract-chunks --help
usage: child-project zooniverse extract-chunks [-h] --keyword KEYWORD
[--chunks-length CHUNKS_LENGTH]
[--chunks-min-amount CHUNKS_MIN_AMOUNT]
--segments SEGMENTS
--destination DESTINATION
[--exclude-segments EXCLUDE_SEGMENTS [EXCLUDE_SEGMENTS ...]]
[--threads THREADS]
path
positional arguments:
path path to the dataset
optional arguments:
-h, --help show this help message and exit
--keyword KEYWORD export keyword
--chunks-length CHUNKS_LENGTH
chunk length (in milliseconds). if <= 0, the segments
will not be split into chunks
--chunks-min-amount CHUNKS_MIN_AMOUNT
minimum amount of chunks to extract from a segment
--segments SEGMENTS path to the input segments dataframe
--destination DESTINATION
destination
--exclude-segments EXCLUDE_SEGMENTS [EXCLUDE_SEGMENTS ...]
segments to exclude before sampling
--threads THREADS how many threads to run on
$ child-project zooniverse upload-chunks --help
usage: child-project zooniverse upload-chunks [-h] --chunks CHUNKS
--project-id PROJECT_ID
--set-name SET_NAME
[--amount AMOUNT]
[--zooniverse-login ZOONIVERSE_LOGIN]
[--zooniverse-pwd ZOONIVERSE_PWD]
optional arguments:
-h, --help show this help message and exit
--chunks CHUNKS path to the chunk CSV dataframe
--project-id PROJECT_ID
zooniverse project id
--set-name SET_NAME subject set display name
--amount AMOUNT amount of chunks to upload
--zooniverse-login ZOONIVERSE_LOGIN
zooniverse login. If not specified, the program
attempts to get it from the environment variable
ZOONIVERSE_LOGIN instead
--zooniverse-pwd ZOONIVERSE_PWD
zooniverse password. If not specified, the program
attempts to get it from the environment variable
ZOONIVERSE_PWD instead
I have just realised I had forgotten to answer about chunkification. if you do not specify a value for --chunk-length, currently, input segments will not be split (because the default value is zero). but we could change the default to a non-zero value (e.g. 500)
THIS IS THE MOST UP TO DATE VERSION OF THE CHEAT SHEET -- NOT TESTED THE WHOLE THING & chunkify section needs a second check. Consider also replacing the scripts with commands in other sections
This is a cheatsheet for extracting & pushing clips onto zooniverse. It works on oberon; it does not work on my home computer (git-annex cannot be downloaded with my OS; not enough space for the audios).
I've adapted the zoo example python script and the zoo-phon-data script. I created two separate scripts: one for sampling, one for uploading.
I start by installing the dataset.
datalad install git@github.com:LAAC-LSCP/solomon-data.git
cd solomon-data
source ~/ChildProjectVenv/bin/activate
datalad run-procedure setup
Then I get the recordings & the VTC annotations, and validate.
datalad get recordings/converted
datalad get annotations/vtc/converted
child-project validate .
Both of those steps can be skipped if I already have the data.
I'm about to extract many files that can be re-generated if need be, and take up space + slow down indexing, so even before I generate them, I want to tell DataLad not to pay attention to them. This way, they won't get tracked or pushed. For more information on avoiding DataLad tracking look here). For our purposes, all we need to do is the following:
echo "samples/CHI_FEM/*" >> .gitignore # add the folder that we will create in the next step to the list of folders to ignore
datalad save -m "ignore extracts folder" .gitignore
Then I sample segments, chunkify, and upload.
For sampling, I'll do 250 random CHI vocs + 250 random FEM vocs. I decided to store the sound files in a folder called samples/CHI_FEM/
, which I'll push. My adapted script, therefore, looks like this:
#!/usr/bin/env python3
from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
from ChildProject.pipelines.zooniverse import ZooniversePipeline
from ChildProject.pipelines.samplers import RandomVocalizationSampler
import argparse
import os
import pandas as pd
project = ChildProject('.')
project.read()
random_sampler = RandomVocalizationSampler(
project,
annotation_set = 'vtc',
target_speaker_type = ['CHI'],
sample_size = 250
)
random_sampler.sample()
os.makedirs('samples/CHI_FEM/random', exist_ok = True)
random_sampler.segments[['recording_filename', 'segment_onset', 'segment_offset']].to_csv('samples/CHI_FEM/random/samples.csv', index = False)
random_sampler = RandomVocalizationSampler(
project,
annotation_set = 'vtc',
target_speaker_type = ['FEM'],
sample_size = 250
)
random_sampler.sample()
random_sampler.segments[['recording_filename', 'segment_onset', 'segment_offset']].to_csv('samples/CHI_FEM/random/samples2.csv', index = False)
a = pd.read_csv('samples/CHI_FEM/random/samples.csv')
b = pd.read_csv('samples/CHI_FEM/random/samples2.csv')
c = pd.concat([a, b], join='outer')
c.to_csv("samples/CHI_FEM/random/samples.csv", index = False)
And I call it like this because all the paths are defined inside the code:
python scripts/sample_segments.py
For chunkification, I'll do 500 ms length and only 2 threads as I'm in a smaller computer than the cluster. My script looks like this:
#!/usr/bin/env python3
from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
from ChildProject.pipelines.zooniverse import ZooniversePipeline
import argparse
import os
import pandas as pd
project = ChildProject('.')
project.read()
zooniverse = ZooniversePipeline()
chunks_path = zooniverse.extract_chunks(
path = project.path,
destination = 'samples/CHI_FEM/random/',
keyword = 'ac_20210421a',
segments = 'samples/CHI_FEM/random/samples.csv',
chunks_length = 500,
chunks_min_amount = 2,
threads = 2,
profile = 'standard'
)
This step takes a while, so to be on the safe side, I first do a screen, activate the environment, and call the script (like this because all the paths are defined inside the code):
screen
source ~/ChildProjectVenv/bin/activate
python scripts/chunkify_segments.py
NOTE! one problem of doing the above is that I didn't overtly define a name for the chunks.csv file to be generated. So alternatively, next time, I could do instead:
screen
source ~/ChildProjectVenv/bin/activate
child-project zooniverse extract-chunks . --segments 'samples/CHI_FEM/random/samples.csv' --chunks-length 500 --chunks-min-amount 2 --threads 2 --profile 'standard' --keyword 'ac_20210421a' --destination 'samples/CHI_FEM/random/'
For upload, I target our new project and don't batch them as it's no longer needed. I directly call the function:
child-project zooniverse upload-chunks --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/chunks_20210430_112933.csv' --project-id 14957 --zooniverse-login acristia --zooniverse-pwd MYPASSWORD
The next step is to create a record that I did this, but updating the data. Here is my command draft for that:
datalad save -m "adding record of zoo chunks"
datalad push
Eventually, I'll get classifications:
child-project zooniverse retrieve-classifications solomon-data --destination solomon-data/samples/CHI_FEM/random/classifications_2021-04-10.csv --project-id 14957
And repeat the data update.
datalad save -m "adding record of coded zoo chunks"
datalad push
That seems good (a few details: sample_size should be 500 instead of 250 according to your description, and the destination of zooniverse classifications should be something like samples/random instead of samples/high-volubility for consistency, but these are all details/probably typos).
However, there are a few issues:
thanks for the proofing!
I see in the sampler docs that I can specify multiple talkers. If I changed my code to:
random_sampler = RandomVocalizationSampler(
project,
annotation_set = 'vtc',
target_speaker_type = ['CHI','FEM'],
sample_size = 500
)
will I get 250 of each, or no assurance on this?
Nope, it will sample uniformly among the union of CHI and FEM segments.
So you need to sample them separately if you want the same amount of each.
You can then concat the dataframes and save them into one dataframe if is more convenient to you however.
roger! I fixed a couple of typos and I'm close, but:
$ python scripts/sample_segments.py
/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy errors=errors, Traceback (most recent call last): File "scripts/sample_segments.py", line 20, in
random_sampler.sample() File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/ChildProject/pipelines/samplers.py", line 180, in sample self.segments = self.segments.groupby('recording_filename').sample(self.sample_size) AttributeError: 'NoneType' object has no attribute 'groupby'
Your script is working for me - at least the sampling part, I have not tested the zooniverse part.
A few suggestions:
more annotations/vtc/converted/*
)I tried from oberon, where the error does NOT replicate - but I get a new error. On oberon, upgraded package, checked VTC annotations (they are there, eg: annotations/vtc/converted/01_CW01_CH01_FB03_FB11_190622_0_0.csv
), and tried again, and still get the same oberon-error (not the same error I got in home pc):
$ python scripts/sample_segments.py
/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy errors=errors, Traceback (most recent call last): File "scripts/sample_segments.py", line 20, in
random_sampler.sample() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/samplers.py", line 180, in sample self.segments = self.segments.groupby('recording_filename').sample(self.samplesize) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2865, in sample for (, obj), w in zip(self, ws) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2865, in for (_, obj), w in zip(self, ws) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/generic.py", line 4993, in sample locs = rs.choice(axis_length, size=n, replace=replace, p=weights) File "mtrand.pyx", line 954, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'
My naïve reading of the error is that there are fewer vocalizations than the ones I asked for, correct?
You are right. However, this should not happen with the latest version of the package (I can see from the error that the code is outdated)
can you try upgrading again ?
pip3 install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade
$ pip3 install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade
Collecting git+https://github.com/LAAC-LSCP/ChildProject.git Cloning https://github.com/LAAC-LSCP/ChildProject.git to /tmp/pip-req-build-3ogs07x4 Running command git clone -q https://github.com/LAAC-LSCP/ChildProject.git /tmp/pip-req-build-3ogs07x4 Requirement already satisfied: pandas in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.1.4) Requirement already satisfied: xlrd in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.2.0) Requirement already satisfied: jinja2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (2.11.2) Requirement already satisfied: numpy>=1.16.5 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.19.4) Requirement already satisfied: pympi-ling in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.69) Requirement already satisfied: lxml in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (4.6.3) Requirement already satisfied: sox in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.4.1) Requirement already satisfied: datalad in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (0.14.1) Requirement already satisfied: requests<2.25.0 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (2.24.0) Requirement already satisfied: PyYAML in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (5.4.1) Requirement already satisfied: panoptes-client in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.3.0) Requirement already satisfied: pydub in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (0.25.1) Collecting importlib-resources Downloading importlib_resources-5.1.2-py3-none-any.whl (25 kB) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from requests<2.25.0->ChildProject==0.0.1) (1.25.11) Requirement already satisfied: idna<3,>=2.5 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from requests<2.25.0->ChildProject==0.0.1) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from requests<2.25.0->ChildProject==0.0.1) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from requests<2.25.0->ChildProject==0.0.1) (2020.12.5) Requirement already satisfied: boto in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (2.49.0) Requirement already satisfied: iso8601 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (0.1.14) Requirement already satisfied: PyGithub in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.54.1) Requirement already satisfied: appdirs in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.4.4) Requirement already satisfied: whoosh in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (2.7.4) Requirement already satisfied: patool>=1.7 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.12) Requirement already satisfied: humanize in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (3.3.0) Requirement already satisfied: annexremote in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.5.0) Requirement already satisfied: fasteners>=0.14 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (0.16) Requirement already satisfied: keyring>=8.0 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (23.0.1) Requirement already satisfied: keyrings.alt in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (4.0.2) Requirement already satisfied: msgpack in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.0.2) Requirement already satisfied: tqdm in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (4.59.0) Requirement already satisfied: jsmin in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (2.2.2) Requirement already satisfied: simplejson in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (3.17.2) Requirement already satisfied: wrapt in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.12.1) Requirement already satisfied: six in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from fasteners>=0.14->datalad->ChildProject==0.0.1) (1.15.0) Requirement already satisfied: jeepney>=0.4.2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from keyring>=8.0->datalad->ChildProject==0.0.1) (0.6.0) Requirement already satisfied: SecretStorage>=3.2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from keyring>=8.0->datalad->ChildProject==0.0.1) (3.3.1) Requirement already satisfied: importlib-metadata>=3.6 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from keyring>=8.0->datalad->ChildProject==0.0.1) (3.10.0) Requirement already satisfied: typing-extensions>=3.6.4 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from importlib-metadata>=3.6->keyring>=8.0->datalad->ChildProject==0.0.1) (3.7.4.3) Requirement already satisfied: zipp>=0.5 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from importlib-metadata>=3.6->keyring>=8.0->datalad->ChildProject==0.0.1) (3.4.1) Requirement already satisfied: cryptography>=2.0 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from SecretStorage>=3.2->keyring>=8.0->datalad->ChildProject==0.0.1) (3.4.7) Requirement already satisfied: cffi>=1.12 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from cryptography>=2.0->SecretStorage>=3.2->keyring>=8.0->datalad->ChildProject==0.0.1) (1.14.5) Requirement already satisfied: pycparser in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from cffi>=1.12->cryptography>=2.0->SecretStorage>=3.2->keyring>=8.0->datalad->ChildProject==0.0.1) (2.20) Requirement already satisfied: future in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from annexremote->datalad->ChildProject==0.0.1) (0.18.2) Requirement already satisfied: setuptools in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from humanize->datalad->ChildProject==0.0.1) (40.6.2) Requirement already satisfied: MarkupSafe>=0.23 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from jinja2->ChildProject==0.0.1) (1.1.1) Requirement already satisfied: python-dateutil>=2.7.3 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from pandas->ChildProject==0.0.1) (2.8.1) Requirement already satisfied: pytz>=2017.2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from pandas->ChildProject==0.0.1) (2020.4) Requirement already satisfied: python-magic<0.5,>=0.4 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from panoptes-client->ChildProject==0.0.1) (0.4.22) Requirement already satisfied: redo>=1.7 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from panoptes-client->ChildProject==0.0.1) (2.0.4) Requirement already satisfied: pyjwt<2.0 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from PyGithub->datalad->ChildProject==0.0.1) (1.7.1) Requirement already satisfied: deprecated in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from PyGithub->datalad->ChildProject==0.0.1) (1.2.12) Installing collected packages: importlib-resources Successfully installed importlib-resources-5.1.2
(ChildProjectVenv) [acristia@oberon solomon-data]$ python scripts/sample_segments.py
/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy errors=errors, Traceback (most recent call last): File "scripts/sample_segments.py", line 20, in
random_sampler.sample() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/samplers.py", line 180, in sample self.segments = self.segments.groupby('recording_filename').sample(self.samplesize) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2865, in sample for (, obj), w in zip(self, ws) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2865, in for (_, obj), w in zip(self, ws) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/generic.py", line 4993, in sample locs = rs.choice(axis_length, size=n, replace=replace, p=weights) File "mtrand.pyx", line 954, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'
neither of the following tried, even in a virtual environment:
pip3 install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade
pip install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade
however, uninstalling and reinstalling got rid of the error
pip uninstall ChildProject
pip install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade
Then the script runs.
in the zooniverse section, I got
The above exception was the direct cause of the following exception: Traceback (most recent call last): File "scripts/zoo_segments.py", line 25, in
threads = 2 File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 176, in extract_chunks self.chunks = pool.map(self.split_recording, segments) File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get raise self._value pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1 Output from ffmpeg/avlib: ffmpeg version 2.8.15 Copyright (c) 2000-2018 the FFmpeg developers built with gcc 4.8.5 (GCC) 20150623 (Red Hat 4.8.5-28) configuration: --prefix=/usr --bindir=/usr/bin --datadir=/usr/share/ffmpeg --incdir=/usr/include/ffmpeg --libdir=/usr/lib64 --mandir=/usr/share/man --arch=x86_64 --optflags='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' --extra-ldflags='-Wl,-z,relro ' --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libvo-amrwbenc --enable-version3 --enable-bzlib --disable-crystalhd --enable-gnutls --enable-ladspa --enable-libass --enable-libcdio --enable-libdc1394 --disable-indev=jack --enable-libfreetype --enable-libgsm --enable-libmp3lame --enable-openal --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-libschroedinger --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libvorbis --enable-libv4l2 --enable-libx264 --enable-libx265 --enable-libxvid --enable-x11grab --enable-avfilter --enable-avresample --enable-postproc --enable-pthreads --disable-static --enable-shared --enable-gpl --disable-debug --disable-stripping --shlibdir=/usr/lib64 --enable-runtime-cpudetect libavutil 54. 31.100 / 54. 31.100 libavcodec 56. 60.100 / 56. 60.100 libavformat 56. 40.101 / 56. 40.101 libavdevice 56. 4.100 / 56. 4.100 libavfilter 5. 40.101 / 5. 40.101 libavresample 2. 1. 0 / 2. 1. 0 libswscale 3. 1.101 / 3. 1.101 libswresample 1. 2.101 / 1. 2.101 libpostproc 53. 3.100 / 53. 3.100 Guessed Channel Layout for Input Stream #0.0 : stereo Input #0, wav, from './recordings/raw/01_CW02_CH02_LM03_LM40_190619.WAV': Metadata: encoder : Lavf56.40.101 Duration: 16:42:35.34, bitrate: 128 kb/s Stream #0:0: Audio: adpcm_ima_wav ([17][0][0][0] / 0x0011), 16000 Hz, 2 channels, s16p, 128 kb/s Unknown encoder 'pcm_s4le'
This was because I was using the raw recordings, rather than the converted recordings.
I'm very close, but not quite done! I'm at the step where I extract & upload segments to zooniverse, the upload phase in my cheatsheet.
In oberon, I'm doing:
source ~/ChildProjectVenv/bin/activate
nohup python scripts/zoo_segments.py &
And getting:
Traceback (most recent call last): File "scripts/zoo_segments.py", line 3, in
from ChildProject.projects import ChildProject ImportError: No module named ChildProject.projects exported chunks metadata to samples/CHI_FEM/random/chunks_20210424_213335.csv exported extract-chunks parameters to samples/CHI_FEM/random/parameters_20210424_213335.yml Traceback (most recent call last): File "scripts/zoo_segments.py", line 32, in set_prefix = 'ac_20210421' TypeError: upload_chunks() missing 1 required positional argument: 'set_name' extracting chunks from ./recordings/converted/standard/01_CW02_CH02_LM03_LM40_190619.WAV... samples/CHI_FEM/random/chunks/01_CW02_CH02_LM03_LM40_190619_30942616_30943116.wav already exists, exportation skipped.
Note that I added a set_name to my script (although the sample script didn't have this).
Also, datalad save -m "adding record of upload script"
is very slow -- probably because I didn't make the right decision regarding where to save the extracts.
Are you sure nohup
is preserving the environment ?
I would suggest you to run the script in a screen instead. You can start a screen by doing screen
, then do source ~/ChildProjectVenv/bin/activate
and run the script
You can detach from the screen by doing Ctrl+d+a.
You can also do screen -ls
to list all running screens, and screen -r [screen]
to reattach one of them.
Also,
datalad save -m "adding record of upload script"
is very slow -- probably because I didn't make the right decision regarding where to save the extracts.
Yes, I think they should not be saved. That's like 200.000 files in your case! Remember you can speed up most datalad operations by using the -J
switch, specifying the amount of threads to run.
Also,
datalad save -m "adding record of upload script"
is very slow -- probably because I didn't make the right decision regarding where to save the extracts.Yes, I think they should not be saved. That's like 200.000 files in your case! Remember you can speed up most datalad operations by using the
-J
switch, specifying the amount of threads to run.
I'm sorry, I'm not sure I understand how to fix the situation and/or how to do this better next time. Let me lay out some possible lessons:
samples/
within the folder. I don't think you're saying this, right?So if I had done things properly, I should have done this before actually creating the samples:
echo "samples/CHI_FEM/*" >> .gitignore
datalad save -m "ignore extracts folder" .gitignore
Sadly, that's not what I did, so now even doing datalad status
is super slow because of the zillion files.
I can keep reading the manual, but if you already know a way in which I can fix my previous error, that would be really helpful!
I think the best way is the one you described: you can leave your samples into the dataset, but make sure you add a .gitignore file beforehand.
Now, in order to recover a clean dataset, assuming the chunks were added in the last commit, you can do:
git reset HEAD~1
echo "samples/CHI_FEM/chunks/*" >> .gitignore
datalad save -m "ignore extracts folder" .gitignore
datalad save "samples/CHI_FEM/" -m "adding samples"
(Something like this should work)
For further clean up, you should remove the dangling chunks from the annex as well (see https://git-annex.branchable.com/walkthrough/unused_data/)
great, and to check whether that's the case, I can do git log -n 1
and look at the name of my last commit
it's the last mile! Last error is:
child-project zooniverse upload-chunks 'samples/CHI_FEM/random/' --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/samples.csv' --project-id 14957
yields:
usage: child-project [-h] {validate,import-annotations,merge-annotations,remove-annotations,rename-annotations,import-data,overview,compute-durations,convert,sampler,zooniverse,eaf-builder,anonymize} ... child-project: error: unrecognized arguments: samples/CHI_FEM/random/
https://childproject.readthedocs.io/en/latest/zooniverse.html#chunk-upload
shows:
child-project zooniverse upload-chunks /path/to/dataset --help usage: child-project zooniverse upload-chunks [-h] --chunks CHUNKS --project-id PROJECT_ID --set-name SET_NAME
I don't see my error, do you?
The error was that the command should have been:
child-project zooniverse upload-chunks --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/chunks_20210430_112933.csv' --project-id 14957 --zooniverse-login acristia --zooniverse-pwd MYPASSWORD
That did create the subject up in zooniverse, but it didn't push clips, however. Here is the output:
uploading chunk 1_CW5_CH5_AJ09_AJ10_190710.WAV (23668064,23668564) Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/bin/child-project", line 11, in
load_entry_point('ChildProject==0.0.1', 'console_scripts', 'child-project')() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/cmdline.py", line 311, in main args.func(args) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/cmdline.py", line 31, in _parser.set_defaults(func = lambda args: cls().run(vars(args))) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 371, in run return self.upload_chunks(kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 291, in upload_chunks subject.save() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/subject.py", line 144, in save log_args=False, File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/redo/init.py", line 170, in retry return action(*args, **kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 815, in save etag=self.etag File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 404, in post retry=retry, File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 281, in json_request json_response['errors'] panoptes_client.panoptes.PanoptesAPIException: User has uploaded 12778 subjects of 10000 maximum
And a snapshot of the Zooniverse subject section:
It looks like the error is exceeding 10k quota.
So I tried again, this time specifying an amount:
child-project zooniverse upload-chunks --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/chunks_20210430_112933.csv' --project-id 14957 --zooniverse-login acristia --zooniverse-pwd MYPASSWORD --amount 9999
Unfortunately, I get the same error:
uploading chunk 1_CW5_CH5_AJ09_AJ10_190710.WAV (23668064,23668564) Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/bin/child-project", line 11, in
load_entry_point('ChildProject==0.0.1', 'console_scripts', 'child-project')() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/cmdline.py", line 311, in main args.func(args) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/cmdline.py", line 31, in _parser.set_defaults(func = lambda args: cls().run(vars(args))) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 371, in run return self.upload_chunks(kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 291, in upload_chunks subject.save() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/subject.py", line 144, in save log_args=False, File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/redo/init.py", line 170, in retry return action(*args, **kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 815, in save etag=self.etag File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 404, in post retry=retry, File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 281, in json_request json_response['errors'] panoptes_client.panoptes.PanoptesAPIException: User has uploaded 12778 subjects of 10000 maximum
My understanding is that the 10,000 quota is a limit for the whole project, and that you have to ask the administrators to have them increased. EDIT: you can add me as an administrator of the project so that I can ask Zooniverse's staff to increase quotas on your behalf
certainly, we'll ask - in fact, we also need to ask if we can bypass the beta phase (given that we already did it with our other project). But before we do that, I'd like to try out the interface with some sample data.
Is there a way in which I can push up just a few clips? I thought the "amount" flag did that, in child-project zooniverse upload-chunks --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/chunks_20210430_112933.csv' --project-id 14957 --zooniverse-login acristia --zooniverse-pwd MYPASSWORD --amount 9999
The --amount flag does exactly that - at least it should. But you've reached your project quota already, so even one clip (--amount 1) will be too much
but there are no subjects -- so how can it think that we've gone over our quota?
Also, notice that in my screenshot, it says "The project has 0 uploaded subjects. You have uploaded 0 subjects from an allowance of 10000. Your uploaded subject count is the tally of all subjects (including those deleted) that your account has uploaded through the project builder or Zooniverse API. Please contact us to request changes to your allowance."
Weird! Could be because too many subjects were uploaded the first time and upload did not complete (because of the exception thrown by the API). So, there might be a bunch of dangling subjects uploaded with no subject set. (I don't know, I am really taking wild guesses.)
I'll try to see if there's a way to find invisible subjects like this In any case, maybe try to have your quota increased and ask Zooniverse about this at the same time. Imo this should be considered as bug
I realised I have access to your project, so I can take care of it. How urgent is this ?
not urgent, but if we could get a couple of subjects in there, so I can test the project's interface, that would unblock me to ask them for permission etc.
Well, I just managed to get chunks through, on the same project and subject set. Can you give it another try ? Try low values for --amount (e.g. 1 to begin with)
note that his affects my account specifically (not the project) https://www.zooniverse.org/talk/18/2002495?comment=3266579&page=1
This cheatsheet is outdated! look at https://gin.g-node.org/LAAC-LSCP/zoo-campaign#comparing-zooniverse-annotations-with-other-annotations instead
Hi, I'm trying to create a cheatsheet for myself for pushing extracting & clips onto zooniverse. I'll always do this on oberon, so I'll only think of that case.
So far I have:
But that last step fails:
Just in case, I pushed along and the next step worked:
I'm uncertain as to the following step. I think I should select segments somehow -- but in the docs the next step is the chunkification of the segments.
About chunkification, I currently have this draft of the command:
Should destination be inside solomon-data or somewhere else? What happens if I leave it unspecified? Same for chunks-length & batch-size. Could/should we have a default behavior that means that the chunks will be created inside solomon-data in some place such that the actual mp3/wavs don't get included in the data but the metadata etc does?
What is batch-size, actually? Why is it declared in the chunkification stage in addition to the upload stage? I just saw in the upload stage this is optional -- shouldn't it be mandatory (or have a default of 1000) in the upload stage?
The step after that is chunk upload. Here is my command draft:
Have we decided on a naming convention for the prefix?
The next step is to create a record that I did this, but updating the data. Here is my command draft for that:
Eventually, I'll get classifications:
And repeat the data update.