Open neouyghur opened 1 week ago
Merhaba Osman, I didn't know anybody else is using this repo, thanks for your interest :)
I try to keep my code to the latest possible requirements, and cannot care about older ones. Currently I use Python 3.12.4 and pandas 2.2.2 for example. It is my mistake not to enforce these on requirements.txt
of course :(
Please try to recreate the env with the latest versions using Python 3.12.x as base and see if it works. If it fails, please post a full error/stacktrace here. The one above does not show any clue about where in my code the problem arises.
_PS: I'm currently working on another branch where I move the "experiments" directory out of the repo directory as it became pretty huge as I'm working with all languages, and add a delta_upgrade.py script to upgrade v18.0 to v19.0 with the v19.0 delta files for example. I'm also be moving my whole voice-AI related stack into a combined mono-repo, with many additions and improvements._
This is a very nice repo. It is beneficial for low-resource languages. I was surprised that only a few users gave a star to this repo.
I think the issues related to delete users
. Please check the following output:
Re-splitting for 1 out of 1 corpora in 128 processes. Skipping 0 as they already exist. 0%| | 0/1 [00:01<?, ?it/s] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/workspace/hobby/cv-tbox-split-maker/venv/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc return self._engine.get_loc(casted_key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'client_id'
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/pkg/suse12/software/Python/3.11.5-GCCcore-13.2.0/lib/python3.11/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) ^^^^^^^^^^^^^^^^^^^ File "/workspace/hobby/cv-tbox-split-maker/algorithm_s5.py", line 96, in corpora_creator_original df_corpus = remove_deleted_users(df_corpus) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/hobby/cv-tbox-split-maker/lib.py", line 98, in remove_deleted_users return df_val[ ~df_val["client_id"].isin(df_del["client_id"]) ]
File "/workspace/hobby/cv-tbox-split-maker/venv/lib/python3.11/site-packages/pandas/core/frame.py", line 4102, in __getitem__
indexer = self.columns.get_loc(key)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/hobby/cv-tbox-split-maker/venv/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'client_id'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/workspace/hobby/cv-tbox-split-maker/algorithm_s5.py", line 213, in <module>
main()
File "/workspace/hobby/cv-tbox-split-maker/algorithm_s5.py", line 201, in main
for result in pool.imap_unordered(
File "/pkg/suse12/software/Python/3.11.5-GCCcore-13.2.0/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
KeyError: 'client_id'
Hi,
Q1 The dataset version is: 19 language: uyghur
Q2 I don't think it is related to validated.tsv. I can run your code if I comment on the this line
File "/workspace/hobby/cv-tbox-split-maker/algorithm_s5.py", line 96, in
corpora_creator_original
df_corpus = remove_deleted_users(df_corpus)
Q3 Yes, I did pull the latest version.
On Mon, Sep 23, 2024 at 7:19 PM bozden @.***> wrote:
- Can you provide info about what dataset (language / version) you are dealing with - for me to run it on my side?
- Also, can you please check if that dataset is fully expanded and is intact in your expanded dataset directory? Especially check the "validated.tsv" file please. It seems like there is no "client_id" in it ???
- Did you pull latest changes from the repo recently? (I don't have versioning/tagging in that repo as it is always WIP)
— Reply to this email directly, view it on GitHub https://github.com/HarikalarKutusu/cv-tbox-split-maker/issues/6#issuecomment-2367654807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQRQ3CSSTQ3UTWO3KXATKTZX7MLRAVCNFSM6AAAAABOUKGN4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRXGY2TIOBQG4 . You are receiving this because you authored the thread.Message ID: @.***>
I can run your code if I comment on the this line
OK then, live with it :) It is not much a consequence. It is meant for earlier datasets, where people pulled-off their recording in later datasets. In v19, they do not exist anyway.
I'll test it with your dataset tomorrow.
Merhaba,
I got
client_id
key error while running alogrithm_s5.py when I created the virtual environment as you suggested. However, I didn't get this error with my old environment. So I guess It might be related topandas
version.