DavidNemeskey / cc_corpus

Tools for compiling corpora from Common Crawl
GNU Lesser General Public License v3.0
12 stars 1 forks source link

*.hu unrecognized #43

Closed bbrancar closed 1 year ago

bbrancar commented 1 year ago

Hi, I'm currently attempting to go through the example data pull and have run into an issue off the bat. When I run get_indexfiles.py -q *.hu -o 2019/cc_index -l 2019_01.log -m 5 -c CC-MAIN-2019-04 I receive zsh: no matches found: *.hu. Any clarification would be appreciated. Thank you.

DavidNemeskey commented 1 year ago

* is a wildcard and will be expanded by the shell the same way as rm * will expand * to the contents of the current directory (and subsequently, delete them). You should quote the argument to avoid that:

get_indexfiles.py -q "*.hu" -o 2019/cc_index -l 2019_01.log -m 5 -c CC-MAIN-2019-04

If it doesn't work, try with single quotes ('*.hu'). Interesting though that I have never run into this problem with bash.

I will update the example in the README file.

DavidNemeskey commented 1 year ago

@bbrancar We have updated that script. ATM moment it doesn't recognize * patterns; you can specify a TLD instead (I will update the documentation, but for now, run the script with -h to see the options). We plan to reintroduce certain patterns, so be sure to quote it then.