bitextor / bicleaner

Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
GNU General Public License v3.0
150 stars 22 forks source link

Allow bicleaner to read files from stdin (or gzip/xz compressed files) #10

Closed lpla closed 5 years ago

lpla commented 5 years ago

Hi. Is it possible to change the way argparse manages the input to make it also compatible with stdin? (or compressed files like gzip or xz, using the actual path argument method)

mbanon commented 5 years ago

Can I have an example, please?

You can currently use compressed files with something like:

zcat corpus.gz | bicleaner_classify - output training.yaml

lpla commented 5 years ago

Are you sure? Code doesn't seem to do that as input is managed this way:

parser.add_argument('input', type=argparse.FileType('rt'), default=None, help="Tab-separated files to be classified")

I don't know if this manages stdin when given default=None. Now that I am looking that the input management code, maybe just changing that to default=sys.stdin works (unable to test it now, sorry, maybe tomorrow or Monday).

mbanon commented 5 years ago

I've always been running Bicleaner as in my previous comment, so... 100% sure it works

mbanon commented 5 years ago

Hey @lpla , did you try it? Can we close this issue?

lpla commented 5 years ago

Ok, yes, it works, if you pass the argument '-' as input it reads text from pipe. Thanks!