HarikalarKutusu / cv-tbox-split-maker

Checks diversity in Mozilla Common Voice default or alternative splits for multiple versions and languages
Mozilla Public License 2.0
1 stars 0 forks source link

client_id key error #6

Open neouyghur opened 1 week ago

neouyghur commented 1 week ago

Merhaba,

I got client_id key error while running alogrithm_s5.py when I created the virtual environment as you suggested. However, I didn't get this error with my old environment. So I guess It might be related to pandas version.

venv/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'client_id'
HarikalarKutusu commented 1 week ago

Merhaba Osman, I didn't know anybody else is using this repo, thanks for your interest :)

I try to keep my code to the latest possible requirements, and cannot care about older ones. Currently I use Python 3.12.4 and pandas 2.2.2 for example. It is my mistake not to enforce these on requirements.txt of course :(

Please try to recreate the env with the latest versions using Python 3.12.x as base and see if it works. If it fails, please post a full error/stacktrace here. The one above does not show any clue about where in my code the problem arises.

_PS: I'm currently working on another branch where I move the "experiments" directory out of the repo directory as it became pretty huge as I'm working with all languages, and add a delta_upgrade.py script to upgrade v18.0 to v19.0 with the v19.0 delta files for example. I'm also be moving my whole voice-AI related stack into a combined mono-repo, with many additions and improvements._

neouyghur commented 1 week ago

This is a very nice repo. It is beneficial for low-resource languages. I was surprised that only a few users gave a star to this repo.

I think the issues related to delete users. Please check the following output:

Re-splitting for 1 out of 1 corpora in 128 processes. Skipping 0 as they already exist. 0%| | 0/1 [00:01<?, ?it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/workspace/hobby/cv-tbox-split-maker/venv/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'client_id'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/pkg/suse12/software/Python/3.11.5-GCCcore-13.2.0/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/workspace/hobby/cv-tbox-split-maker/algorithm_s5.py", line 96, in corpora_creator_original df_corpus = remove_deleted_users(df_corpus) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/hobby/cv-tbox-split-maker/lib.py", line 98, in remove_deleted_users return df_val[ ~df_val["client_id"].isin(df_del["client_id"]) ]


  File "/workspace/hobby/cv-tbox-split-maker/venv/lib/python3.11/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/hobby/cv-tbox-split-maker/venv/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'client_id'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/hobby/cv-tbox-split-maker/algorithm_s5.py", line 213, in <module>
    main()
  File "/workspace/hobby/cv-tbox-split-maker/algorithm_s5.py", line 201, in main
    for result in pool.imap_unordered(
  File "/pkg/suse12/software/Python/3.11.5-GCCcore-13.2.0/lib/python3.11/multiprocessing/pool.py", line 873, in next
    raise value
KeyError: 'client_id'
HarikalarKutusu commented 1 week ago
  1. Can you provide info about what dataset (language / version) you are dealing with - for me to run it on my side?
  2. Also, can you please check if that dataset is fully expanded and is intact in your expanded dataset directory? Especially check the "validated.tsv" file please. It seems like there is no "client_id" in it ???
  3. Did you pull latest changes from the repo recently? (I don't have versioning/tagging in that repo as it is always WIP)
neouyghur commented 1 week ago

Hi,

Q1 The dataset version is: 19 language: uyghur

Q2 I don't think it is related to validated.tsv. I can run your code if I comment on the this line

File "/workspace/hobby/cv-tbox-split-maker/algorithm_s5.py", line 96, in
corpora_creator_original
df_corpus = remove_deleted_users(df_corpus)

Q3 Yes, I did pull the latest version.

On Mon, Sep 23, 2024 at 7:19 PM bozden @.***> wrote:

  1. Can you provide info about what dataset (language / version) you are dealing with - for me to run it on my side?
  2. Also, can you please check if that dataset is fully expanded and is intact in your expanded dataset directory? Especially check the "validated.tsv" file please. It seems like there is no "client_id" in it ???
  3. Did you pull latest changes from the repo recently? (I don't have versioning/tagging in that repo as it is always WIP)

— Reply to this email directly, view it on GitHub https://github.com/HarikalarKutusu/cv-tbox-split-maker/issues/6#issuecomment-2367654807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQRQ3CSSTQ3UTWO3KXATKTZX7MLRAVCNFSM6AAAAABOUKGN4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRXGY2TIOBQG4 . You are receiving this because you authored the thread.Message ID: @.***>

HarikalarKutusu commented 6 days ago

I can run your code if I comment on the this line

OK then, live with it :) It is not much a consequence. It is meant for earlier datasets, where people pulled-off their recording in later datasets. In v19, they do not exist anyway.

I'll test it with your dataset tomorrow.