When I run python tokenizer.py folder, it does find all the files in subfolders, however, it tries to tokenize the found filenames from the root directory:
[INFO] (MainThread) File projects_success.txt no found
[INFO] (MainThread) Process 1
[INFO] (MainThread) Starting file <3,0,project-folder/test2.js>
[INFO] (MainThread) Starting file <3,1,project-folder/util.js>
[ERROR] (MainThread) File not found <3,1,project-folder/util.js>
[INFO] (MainThread) Starting file <3,2,project-folder/index.js>
[ERROR] (MainThread) File not found <3,2,project-folder/index.js>
I am submitting a PR with a fix. (cc @pedromartins4)
The generic file-level tokenizer (
tokenizers/file-level
) has problems with deep hierarchy of project folders and their subfolders.Let's say I have input dataset of files for tokenization in "project-folder" (
PATH_proj_paths=project-folder
) and it looks like this:When I run
python tokenizer.py folder
, it does find all the files in subfolders, however, it tries to tokenize the found filenames from the root directory:I am submitting a PR with a fix. (cc @pedromartins4)