apicrafter / metacrafter

Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules
Apache License 2.0
44 stars 5 forks source link

Empty Results #28

Open superctj opened 4 months ago

superctj commented 4 months ago

Thank you for open-sourcing this package! I was wondering if the following behavior is expected when running metacrafter scan-file --format short world+City.csv:

Processing file /data/bird_sql/train_csv/world+City.csv

2024-07-03 02:21:56,613 - root - DEBUG - Start processing None

2024-07-03 02:21:56,632 - root - DEBUG - Processing 1000 records of None

2024-07-03 02:21:56,651 - root - DEBUG - Processing 2000 records of None

2024-07-03 02:21:56,670 - root - DEBUG - Processing 3000 records of None

2024-07-03 02:21:56,689 - root - DEBUG - Processing 4000 records of None

No results

The top-5 rows of the csv file are:

ID,Name,CountryCode,District,Population

1,Kabul,AFG,Kabol,1780000

2,Qandahar,AFG,Qandahar,237500

3,Herat,AFG,Herat,186800

4,Mazar-e-Sharif,AFG,Balkh,127800

I was expecting the CountryCode column will be recognized by metacrafter. Is there anything I am missing or did wrong?

By the way, I found the message "Start processing None" is confusing, which is attributed to this line of setting fromfile to None. Probably these debug messages can be improved to be more informative.

ivbeg commented 4 months ago

@superctj it happend since by some reason identification rules not installed with the package. Rules are YAML files that loaded during tool launch. Still metacrafter uses file .metacrafter to find rules if they are not in package dir. You could configure to the rules path in repository https://github.com/apicrafter/metacrafter

For example my .metacrafter file looks like

rulepath:
  - /home/ibegtin/reps/metacrafter/rules
  - /home/ibegtin/reps/metacrafter-rules/rules

and it's located in the home dir.

Second rule path is to the metacrafter-rulesrepository https://github.com/apicrafter/metacrafter-rules
It's not yet python package and you need to install it seperately with python setup.py installcommand since some rules use addition python code.

Final result with CSV file from these top-5 rows should look like this изображение

I will take a look deeper why rules were not installed and probably switch to updating rules from repository automatically on first launch.

About debug messages, sure you right, it should be more polished. I will take a look too

superctj commented 4 months ago

Thank you @ivbeg for the quick response! Looking forward to the new release of the package