LAAC-LSCP / ChildProject

Python package for the management of day-long recordings of children.
https://childproject.readthedocs.io
MIT License
13 stars 5 forks source link

creating a cheatsheet for extracting & pushing clips onto zooniverse #181

Closed alecristia closed 3 years ago

alecristia commented 3 years ago

Hi, I'm trying to create a cheatsheet for myself for pushing extracting & clips onto zooniverse. I'll always do this on oberon, so I'll only think of that case.

So far I have:

datalad install git@github.com:LAAC-LSCP/solomon-data.git
cd solomon-data
source ~/ChildProjectVenv/bin/activate
datalad run-procedure setup

But that last step fails:

[INFO ] Running procedure setup [INFO ] == Command start (output follows) ===== [INFO ] Could not enable annex remote cluster. This is expected if cluster is a pure Git remote, or happens if it is not accessible. Traceback (most recent call last): File "/scratch1/home/acristia/solomon-data/.datalad/procedures/setup.py", line 25, in url = url File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/interface/utils.py", line 482, in eval_func return return_func(generator_func)(args, kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/interface/utils.py", line 470, in return_func results = list(results) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/interface/utils.py", line 401, in generator_func allkwargs): File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/interface/utils.py", line 557, in _process_results for res in results: File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/distribution/siblings.py", line 265, in call res_kwargs): File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/distribution/siblings.py", line 588, in _configure_remote ds.repo.set_preferred_content(prop, var, '.' if name =='here' else name) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/support/annexrepo.py", line 2570, in set_preferred_content return self.call_annex_oneline([property, remote or '.', expr]) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/support/annexrepo.py", line 1296, in call_annex_oneline l for l in self.call_annexitems(args, files=files) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/support/annexrepo.py", line 1296, in l for l in self.call_annexitems(args, files=files) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/support/annexrepo.py", line 1258, in call_annexitems protocol=StdOutErrCapture)['stdout'] File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/support/annexrepo.py", line 987, in _call_annex kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/datalad/cmd.py", line 412, in run results, datalad.support.exceptions.CommandError: CommandError: 'git -c diff.ignoreSubmodules=none annex wanted cluster 'include=' -c annex.dotfiles=true -c 'remote.origin.annex-ssh-options=-o ControlMaster=auto -S /scratch1/home/acristia/.cache/datalad/sockets/bd68b2c3' -c annex.retry=3 -c 'remote.cluster.annex-ssh-options=-o ControlMaster=auto -S /scratch1/home/acristia/.cache/datalad/sockets/0199f269'' failed with exitcode 1 under /scratch1/home/acristia/solomon-data [err: 'Unable to parse git config from cluster ssh: Could not resolve hostname foberon: Name or service not known ConnectionOpenFailedError: 'ssh -fN -o ControlMaster=auto -o ControlPersist=15m -o ControlPath=/scratch1/home/acristia/.cache/datalad/sockets/0199f269 foberon' failed with exitcode 255 [Failed to open SSH connection (could not start ControlMaster process)] fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. git-annex: cannot determine uuid for cluster (perhaps you need to run "git annex sync"?)'] [INFO ] == Command exit (modification check follows) ===== CommandError: '/scratch1/home/acristia/solomon-data/.datalad/procedures/setup.py /scratch1/home/acristia/solomon-data foberon' failed with exitcode 1 under /scratch1/home/acristia/solomon-data (ChildProjectVenv) [acristia@oberon solomon-data]$ datalad run-procedure setup [INFO ] Running procedure setup [INFO ] == Command start (output follows) ===== .: cluster(+) [/scratch1/data/laac_data/solomon-data (git)] [INFO ] Configure additional publication dependency on "cluster" .: origin(-) [git@github.com:LAAC-LSCP/solomon-data.git (git)] [INFO ] == Command exit (modification check follows) =====

Just in case, I pushed along and the next step worked:

datalad get recordings/converted

I'm uncertain as to the following step. I think I should select segments somehow -- but in the docs the next step is the chunkification of the segments.

About chunkification, I currently have this draft of the command:

cd .. #to find myself at the same level as solomon-data, since just now I was inside solomon-data
child-project zooniverse extract-chunks solomon-data --keyword talkerNtype --chunks-length 500 --segments segments.csv --destination solomon-data/annotations/zooniverse/raw --batch-size 1000

Should destination be inside solomon-data or somewhere else? What happens if I leave it unspecified? Same for chunks-length & batch-size. Could/should we have a default behavior that means that the chunks will be created inside solomon-data in some place such that the actual mp3/wavs don't get included in the data but the metadata etc does?

What is batch-size, actually? Why is it declared in the chunkification stage in addition to the upload stage? I just saw in the upload stage this is optional -- shouldn't it be mandatory (or have a default of 1000) in the upload stage?

The step after that is chunk upload. Here is my command draft:

child-project zooniverse upload-chunks solomon-data --chunks solomon-data/annotations/zooniverse/raw/chunks.csv
                                              --project-id 14957
                                              --set-prefix ac_20210408

Have we decided on a naming convention for the prefix?

The next step is to create a record that I did this, but updating the data. Here is my command draft for that:

cd solomon-data
datalad save annotations/zooniverse/raw -m "adding record of zoo chunks"

Eventually, I'll get classifications:

child-project zooniverse retrieve-classifications solomon-data --project-id 14957

And repeat the data update.

cd solomon-data
datalad save annotations/zooniverse/raw -m "adding record of zoo chunks - annotated"
lucasgautheron commented 3 years ago

Hi, I'm trying to create a cheatsheet for myself for pushing extracting & clips onto zooniverse. I'll always do this on oberon, so I'll only think of that case.

So far I have:

datalad install git@github.com:LAAC-LSCP/solomon-data.git
cd solomon-data
source ~/ChildProjectVenv/bin/activate
datalad run-procedure setup

But that last step fails:

I think you may have accidentally run datalad run-procedure setup foberon (probably a copy paste from the doc) right before datalad run-procedure setup (which worked, as you can see, hence why the following steps are working)

Just in case, I pushed along and the next step worked:

datalad get recordings/converted

I'm uncertain as to the following step. I think I should select segments somehow -- but in the docs the next step is the chunkification of the segments.

It is true that you need to provide segments before, which the doc currently does not clearly state. But if you do not provide these segments, you won't be able to go any further.

About chunkification, I currently have this draft of the command:

cd .. #to find myself at the same level as solomon-data, since just now I was inside solomon-data
child-project zooniverse extract-chunks solomon-data --keyword talkerNtype --chunks-length 500 --segments segments.csv --destination solomon-data/annotations/zooniverse/raw --batch-size 1000

Should destination be inside solomon-data or somewhere else? What happens if I leave it unspecified? Same for chunks-length & batch-size. Could/should we have a default behavior that means that the chunks will be created inside solomon-data in some place such that the actual mp3/wavs don't get included in the data but the metadata etc does?

You need to define a destination (otherwise, an error will be thrown, and the script will stop)

It is up to the user to decide where to store the output. It might not be within solomon-data; e.g., if you are developing an analysis aside, you may have imported solomon-data as a subdataset, and the chunks will preferably lie somewhere in your analysis folder. We do not expect every user to push their own chunks to the original dataset in the general case. (Also, honestly, the audio chunks do not need to be kept once they have been uploaded.)

However, in case you want to push the chunks to the dataset, I am not sure annotations is the best fitting place for that. I think a better design would be to create a 'samples' subfolder, as in here: https://github.com/LAAC-LSCP/ChildProject/issues/148#issue-816256844

You could then set the destination to solomon-data/samples/high-volubility/chunks for instance

What is batch-size, actually? Why is it declared in the chunkification stage in addition to the upload stage? I just saw in the upload stage this is optional -- shouldn't it be mandatory (or have a default of 1000) in the upload stage?

Batch size defines how many of the chunks will be grouped and uploaded together. This reproduces the behavior of Chiara's script. This is apparently needed because of Zooniverse upload rate quotas. --batch-size defines how many chunks should each batch contain. At the upload step, you need to define how many of these batches will be uploaded. This way, you can upload n batches the first day, then n more batches the second day, etc. Maybe we could avoid that, and have only one option to specify how many chunks should be uploaded during the upload - in this case, we could drop the batch system. Let me rethink this!

The step after that is chunk upload. Here is my command draft:

child-project zooniverse upload-chunks solomon-data --chunks solomon-data/annotations/zooniverse/raw/chunks.csv
                                              --project-id 14957
                                              --set-prefix ac_20210408

Have we decided on a naming convention for the prefix?

We have not. Should we ?

The next step is to create a record that I did this, but updating the data. Here is my command draft for that:

cd solomon-data
datalad save annotations/zooniverse/raw -m "adding record of zoo chunks"

This is the equivalent of git add & git commit; you will also need to push the data at some point (datalad push)

Eventually, I'll get classifications:

child-project zooniverse retrieve-classifications solomon-data --project-id 14957

And repeat the data update.

cd solomon-data
datalad save annotations/zooniverse/raw -m "adding record of zoo chunks - annotated"

You will also have to set the destination for child-project zooniverse retrieve-classifications, e.g.:

child-project zooniverse retrieve-classifications solomon-data --destination solomon-data/samples/high-volubility/classifications_2021-04-10.csv --project-id XXX

PS: you don't need to cd out of solomon-data; you could just do child-project validate . for instance PPS: if you write a cheatsheet for Zooniverse, can you please share it ? Then I can adapt it into a tutorial for the docs.

lucasgautheron commented 3 years ago

I have found a workaround to avoid the batch system, which I implemented in #182 .

You can try it by installing the package from:

pip install git+https://github.com/LAAC-LSCP/ChildProject.git@zooniverse/improvements --upgrade

Below you can find the upgraded documentation:

$ child-project zooniverse extract-chunks --help
usage: child-project zooniverse extract-chunks [-h] --keyword KEYWORD
                                               [--chunks-length CHUNKS_LENGTH]
                                               [--chunks-min-amount CHUNKS_MIN_AMOUNT]
                                               --segments SEGMENTS
                                               --destination DESTINATION
                                               [--exclude-segments EXCLUDE_SEGMENTS [EXCLUDE_SEGMENTS ...]]
                                               [--threads THREADS]
                                               path

positional arguments:
  path                  path to the dataset

optional arguments:
  -h, --help            show this help message and exit
  --keyword KEYWORD     export keyword
  --chunks-length CHUNKS_LENGTH
                        chunk length (in milliseconds). if <= 0, the segments
                        will not be split into chunks
  --chunks-min-amount CHUNKS_MIN_AMOUNT
                        minimum amount of chunks to extract from a segment
  --segments SEGMENTS   path to the input segments dataframe
  --destination DESTINATION
                        destination
  --exclude-segments EXCLUDE_SEGMENTS [EXCLUDE_SEGMENTS ...]
                        segments to exclude before sampling
  --threads THREADS     how many threads to run on
$ child-project zooniverse upload-chunks --help
usage: child-project zooniverse upload-chunks [-h] --chunks CHUNKS
                                              --project-id PROJECT_ID
                                              --set-name SET_NAME
                                              [--amount AMOUNT]
                                              [--zooniverse-login ZOONIVERSE_LOGIN]
                                              [--zooniverse-pwd ZOONIVERSE_PWD]

optional arguments:
  -h, --help            show this help message and exit
  --chunks CHUNKS       path to the chunk CSV dataframe
  --project-id PROJECT_ID
                        zooniverse project id
  --set-name SET_NAME   subject set display name
  --amount AMOUNT       amount of chunks to upload
  --zooniverse-login ZOONIVERSE_LOGIN
                        zooniverse login. If not specified, the program
                        attempts to get it from the environment variable
                        ZOONIVERSE_LOGIN instead
  --zooniverse-pwd ZOONIVERSE_PWD
                        zooniverse password. If not specified, the program
                        attempts to get it from the environment variable
                        ZOONIVERSE_PWD instead
lucasgautheron commented 3 years ago

I have just realised I had forgotten to answer about chunkification. if you do not specify a value for --chunk-length, currently, input segments will not be split (because the default value is zero). but we could change the default to a non-zero value (e.g. 500)

alecristia commented 3 years ago

THIS IS THE MOST UP TO DATE VERSION OF THE CHEAT SHEET -- NOT TESTED THE WHOLE THING & chunkify section needs a second check. Consider also replacing the scripts with commands in other sections

cheatsheet for zooniverse clip pushing

This is a cheatsheet for extracting & pushing clips onto zooniverse. It works on oberon; it does not work on my home computer (git-annex cannot be downloaded with my OS; not enough space for the audios).

I've adapted the zoo example python script and the zoo-phon-data script. I created two separate scripts: one for sampling, one for uploading.

preparation

I start by installing the dataset.

datalad install git@github.com:LAAC-LSCP/solomon-data.git
cd solomon-data
source ~/ChildProjectVenv/bin/activate
datalad run-procedure setup

Then I get the recordings & the VTC annotations, and validate.

datalad get recordings/converted
datalad get annotations/vtc/converted
child-project validate .

Both of those steps can be skipped if I already have the data.

Preparing the folder

I'm about to extract many files that can be re-generated if need be, and take up space + slow down indexing, so even before I generate them, I want to tell DataLad not to pay attention to them. This way, they won't get tracked or pushed. For more information on avoiding DataLad tracking look here). For our purposes, all we need to do is the following:

echo "samples/CHI_FEM/*" >> .gitignore  # add the folder that we will create in the next step to the list of folders to ignore
datalad save -m "ignore extracts folder" .gitignore 

sampling

Then I sample segments, chunkify, and upload.

For sampling, I'll do 250 random CHI vocs + 250 random FEM vocs. I decided to store the sound files in a folder called samples/CHI_FEM/, which I'll push. My adapted script, therefore, looks like this:

#!/usr/bin/env python3
from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
from ChildProject.pipelines.zooniverse import ZooniversePipeline
from ChildProject.pipelines.samplers import RandomVocalizationSampler

import argparse
import os
import pandas as pd

project = ChildProject('.')
project.read()

random_sampler = RandomVocalizationSampler(
    project,
    annotation_set = 'vtc',
    target_speaker_type = ['CHI'],
    sample_size = 250
)
random_sampler.sample()
os.makedirs('samples/CHI_FEM/random', exist_ok = True)
random_sampler.segments[['recording_filename', 'segment_onset', 'segment_offset']].to_csv('samples/CHI_FEM/random/samples.csv', index = False)

random_sampler = RandomVocalizationSampler(
    project,
    annotation_set = 'vtc',
    target_speaker_type = ['FEM'],
    sample_size = 250
)
random_sampler.sample()
random_sampler.segments[['recording_filename', 'segment_onset', 'segment_offset']].to_csv('samples/CHI_FEM/random/samples2.csv', index = False)

a = pd.read_csv('samples/CHI_FEM/random/samples.csv')
b = pd.read_csv('samples/CHI_FEM/random/samples2.csv')
c = pd.concat([a, b], join='outer')
c.to_csv("samples/CHI_FEM/random/samples.csv", index = False)

And I call it like this because all the paths are defined inside the code:

python scripts/sample_segments.py 

chunkify (not tested)

For chunkification, I'll do 500 ms length and only 2 threads as I'm in a smaller computer than the cluster. My script looks like this:

#!/usr/bin/env python3
from ChildProject.projects import ChildProject
from ChildProject.annotations import AnnotationManager
from ChildProject.pipelines.zooniverse import ZooniversePipeline

import argparse
import os
import pandas as pd

project = ChildProject('.')
project.read()

zooniverse = ZooniversePipeline()

chunks_path = zooniverse.extract_chunks(
    path = project.path,
    destination = 'samples/CHI_FEM/random/',
    keyword = 'ac_20210421a',
    segments = 'samples/CHI_FEM/random/samples.csv',
    chunks_length = 500,
    chunks_min_amount = 2,
    threads = 2,
    profile = 'standard'
)

This step takes a while, so to be on the safe side, I first do a screen, activate the environment, and call the script (like this because all the paths are defined inside the code):

screen
source ~/ChildProjectVenv/bin/activate
python scripts/chunkify_segments.py 

NOTE! one problem of doing the above is that I didn't overtly define a name for the chunks.csv file to be generated. So alternatively, next time, I could do instead:

screen
source ~/ChildProjectVenv/bin/activate
child-project zooniverse extract-chunks . --segments 'samples/CHI_FEM/random/samples.csv' --chunks-length 500 --chunks-min-amount 2 --threads 2 --profile 'standard' --keyword 'ac_20210421a' --destination  'samples/CHI_FEM/random/'

upload

For upload, I target our new project and don't batch them as it's no longer needed. I directly call the function:

child-project zooniverse upload-chunks --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/chunks_20210430_112933.csv' --project-id 14957 --zooniverse-login acristia --zooniverse-pwd MYPASSWORD

record actions

The next step is to create a record that I did this, but updating the data. Here is my command draft for that:

datalad save -m "adding record of zoo chunks"
datalad push

get classifications

Eventually, I'll get classifications:

child-project zooniverse retrieve-classifications solomon-data --destination solomon-data/samples/CHI_FEM/random/classifications_2021-04-10.csv --project-id 14957

And repeat the data update.

datalad save -m "adding record of coded zoo chunks"
datalad push
lucasgautheron commented 3 years ago

That seems good (a few details: sample_size should be 500 instead of 250 according to your description, and the destination of zooniverse classifications should be something like samples/random instead of samples/high-volubility for consistency, but these are all details/probably typos).

However, there are a few issues:

alecristia commented 3 years ago

thanks for the proofing!

I see in the sampler docs that I can specify multiple talkers. If I changed my code to:

random_sampler = RandomVocalizationSampler(
    project,
    annotation_set = 'vtc',
    target_speaker_type = ['CHI','FEM'],
    sample_size = 500
)

will I get 250 of each, or no assurance on this?

lucasgautheron commented 3 years ago

Nope, it will sample uniformly among the union of CHI and FEM segments.

So you need to sample them separately if you want the same amount of each.

You can then concat the dataframes and save them into one dataframe if is more convenient to you however.

alecristia commented 3 years ago

roger! I fixed a couple of typos and I'm close, but:

$ python scripts/sample_segments.py

/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy errors=errors, Traceback (most recent call last): File "scripts/sample_segments.py", line 20, in random_sampler.sample() File "/Users/acristia/ChildProjectVenv/lib/python3.6/site-packages/ChildProject/pipelines/samplers.py", line 180, in sample self.segments = self.segments.groupby('recording_filename').sample(self.sample_size) AttributeError: 'NoneType' object has no attribute 'groupby'

lucasgautheron commented 3 years ago

Your script is working for me - at least the sampling part, I have not tested the zooniverse part.

A few suggestions:

alecristia commented 3 years ago

I tried from oberon, where the error does NOT replicate - but I get a new error. On oberon, upgraded package, checked VTC annotations (they are there, eg: annotations/vtc/converted/01_CW01_CH01_FB03_FB11_190622_0_0.csv), and tried again, and still get the same oberon-error (not the same error I got in home pc):

$ python scripts/sample_segments.py

/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy errors=errors, Traceback (most recent call last): File "scripts/sample_segments.py", line 20, in random_sampler.sample() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/samplers.py", line 180, in sample self.segments = self.segments.groupby('recording_filename').sample(self.samplesize) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2865, in sample for (, obj), w in zip(self, ws) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2865, in for (_, obj), w in zip(self, ws) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/generic.py", line 4993, in sample locs = rs.choice(axis_length, size=n, replace=replace, p=weights) File "mtrand.pyx", line 954, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'

My naïve reading of the error is that there are fewer vocalizations than the ones I asked for, correct?

lucasgautheron commented 3 years ago

You are right. However, this should not happen with the latest version of the package (I can see from the error that the code is outdated)

can you try upgrading again ?

pip3 install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade
alecristia commented 3 years ago

$ pip3 install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade

Collecting git+https://github.com/LAAC-LSCP/ChildProject.git Cloning https://github.com/LAAC-LSCP/ChildProject.git to /tmp/pip-req-build-3ogs07x4 Running command git clone -q https://github.com/LAAC-LSCP/ChildProject.git /tmp/pip-req-build-3ogs07x4 Requirement already satisfied: pandas in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.1.4) Requirement already satisfied: xlrd in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.2.0) Requirement already satisfied: jinja2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (2.11.2) Requirement already satisfied: numpy>=1.16.5 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.19.4) Requirement already satisfied: pympi-ling in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.69) Requirement already satisfied: lxml in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (4.6.3) Requirement already satisfied: sox in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.4.1) Requirement already satisfied: datalad in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (0.14.1) Requirement already satisfied: requests<2.25.0 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (2.24.0) Requirement already satisfied: PyYAML in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (5.4.1) Requirement already satisfied: panoptes-client in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (1.3.0) Requirement already satisfied: pydub in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from ChildProject==0.0.1) (0.25.1) Collecting importlib-resources Downloading importlib_resources-5.1.2-py3-none-any.whl (25 kB) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from requests<2.25.0->ChildProject==0.0.1) (1.25.11) Requirement already satisfied: idna<3,>=2.5 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from requests<2.25.0->ChildProject==0.0.1) (2.10) Requirement already satisfied: chardet<4,>=3.0.2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from requests<2.25.0->ChildProject==0.0.1) (3.0.4) Requirement already satisfied: certifi>=2017.4.17 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from requests<2.25.0->ChildProject==0.0.1) (2020.12.5) Requirement already satisfied: boto in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (2.49.0) Requirement already satisfied: iso8601 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (0.1.14) Requirement already satisfied: PyGithub in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.54.1) Requirement already satisfied: appdirs in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.4.4) Requirement already satisfied: whoosh in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (2.7.4) Requirement already satisfied: patool>=1.7 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.12) Requirement already satisfied: humanize in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (3.3.0) Requirement already satisfied: annexremote in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.5.0) Requirement already satisfied: fasteners>=0.14 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (0.16) Requirement already satisfied: keyring>=8.0 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (23.0.1) Requirement already satisfied: keyrings.alt in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (4.0.2) Requirement already satisfied: msgpack in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.0.2) Requirement already satisfied: tqdm in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (4.59.0) Requirement already satisfied: jsmin in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (2.2.2) Requirement already satisfied: simplejson in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (3.17.2) Requirement already satisfied: wrapt in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from datalad->ChildProject==0.0.1) (1.12.1) Requirement already satisfied: six in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from fasteners>=0.14->datalad->ChildProject==0.0.1) (1.15.0) Requirement already satisfied: jeepney>=0.4.2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from keyring>=8.0->datalad->ChildProject==0.0.1) (0.6.0) Requirement already satisfied: SecretStorage>=3.2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from keyring>=8.0->datalad->ChildProject==0.0.1) (3.3.1) Requirement already satisfied: importlib-metadata>=3.6 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from keyring>=8.0->datalad->ChildProject==0.0.1) (3.10.0) Requirement already satisfied: typing-extensions>=3.6.4 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from importlib-metadata>=3.6->keyring>=8.0->datalad->ChildProject==0.0.1) (3.7.4.3) Requirement already satisfied: zipp>=0.5 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from importlib-metadata>=3.6->keyring>=8.0->datalad->ChildProject==0.0.1) (3.4.1) Requirement already satisfied: cryptography>=2.0 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from SecretStorage>=3.2->keyring>=8.0->datalad->ChildProject==0.0.1) (3.4.7) Requirement already satisfied: cffi>=1.12 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from cryptography>=2.0->SecretStorage>=3.2->keyring>=8.0->datalad->ChildProject==0.0.1) (1.14.5) Requirement already satisfied: pycparser in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from cffi>=1.12->cryptography>=2.0->SecretStorage>=3.2->keyring>=8.0->datalad->ChildProject==0.0.1) (2.20) Requirement already satisfied: future in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from annexremote->datalad->ChildProject==0.0.1) (0.18.2) Requirement already satisfied: setuptools in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from humanize->datalad->ChildProject==0.0.1) (40.6.2) Requirement already satisfied: MarkupSafe>=0.23 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from jinja2->ChildProject==0.0.1) (1.1.1) Requirement already satisfied: python-dateutil>=2.7.3 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from pandas->ChildProject==0.0.1) (2.8.1) Requirement already satisfied: pytz>=2017.2 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from pandas->ChildProject==0.0.1) (2020.4) Requirement already satisfied: python-magic<0.5,>=0.4 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from panoptes-client->ChildProject==0.0.1) (0.4.22) Requirement already satisfied: redo>=1.7 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from panoptes-client->ChildProject==0.0.1) (2.0.4) Requirement already satisfied: pyjwt<2.0 in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from PyGithub->datalad->ChildProject==0.0.1) (1.7.1) Requirement already satisfied: deprecated in /scratch1/home/acristia/ChildProjectVenv/lib/python3.6/site-packages (from PyGithub->datalad->ChildProject==0.0.1) (1.2.12) Installing collected packages: importlib-resources Successfully installed importlib-resources-5.1.2

(ChildProjectVenv) [acristia@oberon solomon-data]$ python scripts/sample_segments.py

/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy errors=errors, Traceback (most recent call last): File "scripts/sample_segments.py", line 20, in random_sampler.sample() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/samplers.py", line 180, in sample self.segments = self.segments.groupby('recording_filename').sample(self.samplesize) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2865, in sample for (, obj), w in zip(self, ws) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2865, in for (_, obj), w in zip(self, ws) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/pandas/core/generic.py", line 4993, in sample locs = rs.choice(axis_length, size=n, replace=replace, p=weights) File "mtrand.pyx", line 954, in numpy.random.mtrand.RandomState.choice ValueError: Cannot take a larger sample than population when 'replace=False'

alecristia commented 3 years ago

neither of the following tried, even in a virtual environment:

pip3 install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade
pip install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade

however, uninstalling and reinstalling got rid of the error

pip uninstall ChildProject
pip install git+https://github.com/LAAC-LSCP/ChildProject.git --upgrade

Then the script runs.

alecristia commented 3 years ago

in the zooniverse section, I got

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "scripts/zoo_segments.py", line 25, in threads = 2 File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 176, in extract_chunks self.chunks = pool.map(self.split_recording, segments) File "/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get raise self._value pydub.exceptions.CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1 Output from ffmpeg/avlib: ffmpeg version 2.8.15 Copyright (c) 2000-2018 the FFmpeg developers built with gcc 4.8.5 (GCC) 20150623 (Red Hat 4.8.5-28) configuration: --prefix=/usr --bindir=/usr/bin --datadir=/usr/share/ffmpeg --incdir=/usr/include/ffmpeg --libdir=/usr/lib64 --mandir=/usr/share/man --arch=x86_64 --optflags='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic' --extra-ldflags='-Wl,-z,relro ' --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libvo-amrwbenc --enable-version3 --enable-bzlib --disable-crystalhd --enable-gnutls --enable-ladspa --enable-libass --enable-libcdio --enable-libdc1394 --disable-indev=jack --enable-libfreetype --enable-libgsm --enable-libmp3lame --enable-openal --enable-libopenjpeg --enable-libopus --enable-libpulse --enable-libschroedinger --enable-libsoxr --enable-libspeex --enable-libtheora --enable-libvorbis --enable-libv4l2 --enable-libx264 --enable-libx265 --enable-libxvid --enable-x11grab --enable-avfilter --enable-avresample --enable-postproc --enable-pthreads --disable-static --enable-shared --enable-gpl --disable-debug --disable-stripping --shlibdir=/usr/lib64 --enable-runtime-cpudetect libavutil 54. 31.100 / 54. 31.100 libavcodec 56. 60.100 / 56. 60.100 libavformat 56. 40.101 / 56. 40.101 libavdevice 56. 4.100 / 56. 4.100 libavfilter 5. 40.101 / 5. 40.101 libavresample 2. 1. 0 / 2. 1. 0 libswscale 3. 1.101 / 3. 1.101 libswresample 1. 2.101 / 1. 2.101 libpostproc 53. 3.100 / 53. 3.100 Guessed Channel Layout for Input Stream #0.0 : stereo Input #0, wav, from './recordings/raw/01_CW02_CH02_LM03_LM40_190619.WAV': Metadata: encoder : Lavf56.40.101 Duration: 16:42:35.34, bitrate: 128 kb/s Stream #0:0: Audio: adpcm_ima_wav ([17][0][0][0] / 0x0011), 16000 Hz, 2 channels, s16p, 128 kb/s Unknown encoder 'pcm_s4le'

This was because I was using the raw recordings, rather than the converted recordings.

alecristia commented 3 years ago

I'm very close, but not quite done! I'm at the step where I extract & upload segments to zooniverse, the upload phase in my cheatsheet.

In oberon, I'm doing:

source ~/ChildProjectVenv/bin/activate
nohup python scripts/zoo_segments.py &

And getting:

Traceback (most recent call last): File "scripts/zoo_segments.py", line 3, in from ChildProject.projects import ChildProject ImportError: No module named ChildProject.projects exported chunks metadata to samples/CHI_FEM/random/chunks_20210424_213335.csv exported extract-chunks parameters to samples/CHI_FEM/random/parameters_20210424_213335.yml Traceback (most recent call last): File "scripts/zoo_segments.py", line 32, in set_prefix = 'ac_20210421' TypeError: upload_chunks() missing 1 required positional argument: 'set_name' extracting chunks from ./recordings/converted/standard/01_CW02_CH02_LM03_LM40_190619.WAV... samples/CHI_FEM/random/chunks/01_CW02_CH02_LM03_LM40_190619_30942616_30943116.wav already exists, exportation skipped.

Note that I added a set_name to my script (although the sample script didn't have this).

alecristia commented 3 years ago

Also, datalad save -m "adding record of upload script" is very slow -- probably because I didn't make the right decision regarding where to save the extracts.

lucasgautheron commented 3 years ago

Are you sure nohup is preserving the environment ?

I would suggest you to run the script in a screen instead. You can start a screen by doing screen, then do source ~/ChildProjectVenv/bin/activate and run the script

You can detach from the screen by doing Ctrl+d+a.

You can also do screen -ls to list all running screens, and screen -r [screen] to reattach one of them.

lucasgautheron commented 3 years ago

Also, datalad save -m "adding record of upload script" is very slow -- probably because I didn't make the right decision regarding where to save the extracts.

Yes, I think they should not be saved. That's like 200.000 files in your case! Remember you can speed up most datalad operations by using the -J switch, specifying the amount of threads to run.

alecristia commented 3 years ago

Also, datalad save -m "adding record of upload script" is very slow -- probably because I didn't make the right decision regarding where to save the extracts.

Yes, I think they should not be saved. That's like 200.000 files in your case! Remember you can speed up most datalad operations by using the -J switch, specifying the amount of threads to run.

I'm sorry, I'm not sure I understand how to fix the situation and/or how to do this better next time. Let me lay out some possible lessons:

So if I had done things properly, I should have done this before actually creating the samples:

echo "samples/CHI_FEM/*" >> .gitignore
datalad save -m "ignore extracts folder" .gitignore

Sadly, that's not what I did, so now even doing datalad status is super slow because of the zillion files.

I can keep reading the manual, but if you already know a way in which I can fix my previous error, that would be really helpful!

lucasgautheron commented 3 years ago

I think the best way is the one you described: you can leave your samples into the dataset, but make sure you add a .gitignore file beforehand.

Now, in order to recover a clean dataset, assuming the chunks were added in the last commit, you can do:

git reset HEAD~1
echo "samples/CHI_FEM/chunks/*" >> .gitignore
datalad save -m "ignore extracts folder" .gitignore
datalad save "samples/CHI_FEM/" -m "adding samples"

(Something like this should work)

For further clean up, you should remove the dangling chunks from the annex as well (see https://git-annex.branchable.com/walkthrough/unused_data/)

alecristia commented 3 years ago

great, and to check whether that's the case, I can do git log -n 1 and look at the name of my last commit

alecristia commented 3 years ago

it's the last mile! Last error is:

child-project zooniverse upload-chunks 'samples/CHI_FEM/random/' --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/samples.csv' --project-id 14957

yields:

usage: child-project [-h] {validate,import-annotations,merge-annotations,remove-annotations,rename-annotations,import-data,overview,compute-durations,convert,sampler,zooniverse,eaf-builder,anonymize} ... child-project: error: unrecognized arguments: samples/CHI_FEM/random/

https://childproject.readthedocs.io/en/latest/zooniverse.html#chunk-upload

shows:

child-project zooniverse upload-chunks /path/to/dataset --help usage: child-project zooniverse upload-chunks [-h] --chunks CHUNKS --project-id PROJECT_ID --set-name SET_NAME

I don't see my error, do you?

alecristia commented 3 years ago

The error was that the command should have been:

child-project zooniverse upload-chunks --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/chunks_20210430_112933.csv' --project-id 14957 --zooniverse-login acristia --zooniverse-pwd MYPASSWORD

That did create the subject up in zooniverse, but it didn't push clips, however. Here is the output:

uploading chunk 1_CW5_CH5_AJ09_AJ10_190710.WAV (23668064,23668564) Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/bin/child-project", line 11, in load_entry_point('ChildProject==0.0.1', 'console_scripts', 'child-project')() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/cmdline.py", line 311, in main args.func(args) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/cmdline.py", line 31, in _parser.set_defaults(func = lambda args: cls().run(vars(args))) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 371, in run return self.upload_chunks(kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 291, in upload_chunks subject.save() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/subject.py", line 144, in save log_args=False, File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/redo/init.py", line 170, in retry return action(*args, **kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 815, in save etag=self.etag File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 404, in post retry=retry, File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 281, in json_request json_response['errors'] panoptes_client.panoptes.PanoptesAPIException: User has uploaded 12778 subjects of 10000 maximum

And a snapshot of the Zooniverse subject section:

image

It looks like the error is exceeding 10k quota.

So I tried again, this time specifying an amount:

child-project zooniverse upload-chunks --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/chunks_20210430_112933.csv' --project-id 14957 --zooniverse-login acristia --zooniverse-pwd MYPASSWORD --amount 9999

Unfortunately, I get the same error:

uploading chunk 1_CW5_CH5_AJ09_AJ10_190710.WAV (23668064,23668564) Traceback (most recent call last): File "/scratch1/home/acristia/ChildProjectVenv/bin/child-project", line 11, in load_entry_point('ChildProject==0.0.1', 'console_scripts', 'child-project')() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/cmdline.py", line 311, in main args.func(args) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/cmdline.py", line 31, in _parser.set_defaults(func = lambda args: cls().run(vars(args))) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 371, in run return self.upload_chunks(kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/ChildProject/pipelines/zooniverse.py", line 291, in upload_chunks subject.save() File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/subject.py", line 144, in save log_args=False, File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/redo/init.py", line 170, in retry return action(*args, **kwargs) File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 815, in save etag=self.etag File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 404, in post retry=retry, File "/scratch1/home/acristia/ChildProjectVenv/lib64/python3.6/site-packages/panoptes_client/panoptes.py", line 281, in json_request json_response['errors'] panoptes_client.panoptes.PanoptesAPIException: User has uploaded 12778 subjects of 10000 maximum

lucasgautheron commented 3 years ago

My understanding is that the 10,000 quota is a limit for the whole project, and that you have to ask the administrators to have them increased. EDIT: you can add me as an administrator of the project so that I can ask Zooniverse's staff to increase quotas on your behalf

alecristia commented 3 years ago

certainly, we'll ask - in fact, we also need to ask if we can bypass the beta phase (given that we already did it with our other project). But before we do that, I'd like to try out the interface with some sample data.

Is there a way in which I can push up just a few clips? I thought the "amount" flag did that, in child-project zooniverse upload-chunks --set-name ac_20210430 --chunks 'samples/CHI_FEM/random/chunks_20210430_112933.csv' --project-id 14957 --zooniverse-login acristia --zooniverse-pwd MYPASSWORD --amount 9999

lucasgautheron commented 3 years ago

The --amount flag does exactly that - at least it should. But you've reached your project quota already, so even one clip (--amount 1) will be too much

alecristia commented 3 years ago

but there are no subjects -- so how can it think that we've gone over our quota?

Also, notice that in my screenshot, it says "The project has 0 uploaded subjects. You have uploaded 0 subjects from an allowance of 10000. Your uploaded subject count is the tally of all subjects (including those deleted) that your account has uploaded through the project builder or Zooniverse API. Please contact us to request changes to your allowance."

lucasgautheron commented 3 years ago

Weird! Could be because too many subjects were uploaded the first time and upload did not complete (because of the exception thrown by the API). So, there might be a bunch of dangling subjects uploaded with no subject set. (I don't know, I am really taking wild guesses.)

I'll try to see if there's a way to find invisible subjects like this In any case, maybe try to have your quota increased and ask Zooniverse about this at the same time. Imo this should be considered as bug

I realised I have access to your project, so I can take care of it. How urgent is this ?

alecristia commented 3 years ago

not urgent, but if we could get a couple of subjects in there, so I can test the project's interface, that would unblock me to ask them for permission etc.

lucasgautheron commented 3 years ago

Well, I just managed to get chunks through, on the same project and subject set. Can you give it another try ? Try low values for --amount (e.g. 1 to begin with)

alecristia commented 3 years ago

note that his affects my account specifically (not the project) https://www.zooniverse.org/talk/18/2002495?comment=3266579&page=1

alecristia commented 3 years ago

This cheatsheet is outdated! look at https://gin.g-node.org/LAAC-LSCP/zoo-campaign#comparing-zooniverse-annotations-with-other-annotations instead