Mondego / SourcererCC

Sourcerer's Code Clone project
GNU General Public License v3.0
206 stars 69 forks source link

Bug with traversing subfolders in file-level tokenizer #8

Closed jakubzitny closed 5 years ago

jakubzitny commented 8 years ago

The generic file-level tokenizer (tokenizers/file-level) has problems with deep hierarchy of project folders and their subfolders.

Let's say I have input dataset of files for tokenization in "project-folder" (PATH_proj_paths=project-folder) and it looks like this:

$ tree project-folder
project-folder
|-- sub
|   |-- subsub
|   |   `-- index.js
|   `-- util.js
`-- test2.js

2 directories, 3 files

When I run python tokenizer.py folder, it does find all the files in subfolders, however, it tries to tokenize the found filenames from the root directory:

[INFO] (MainThread) File projects_success.txt no found
[INFO] (MainThread) Process 1
[INFO] (MainThread) Starting file <3,0,project-folder/test2.js>
[INFO] (MainThread) Starting file <3,1,project-folder/util.js>
[ERROR] (MainThread) File not found <3,1,project-folder/util.js>
[INFO] (MainThread) Starting file <3,2,project-folder/index.js>
[ERROR] (MainThread) File not found <3,2,project-folder/index.js>

I am submitting a PR with a fix. (cc @pedromartins4)

jakubzitny commented 8 years ago

Btw there are weird CR line endings in the src/tokenizer-directory.py resulting in the weird git diffs and other stuff.