attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 965 forks source link

Cannot turn off --html-safe command line option (true by default) #299

Open adno opened 1 year ago

adno commented 1 year ago

Due to a bug, the only way to turn off the --html-safe command line option is passing an empty argument (that evaluates as false in Python) like this:

wikiextractor --html-safe ""

The following does not work :

wikiextractor --no-html-safe wikiextractor --html-safe false

The argument is currently defined like this:

https://github.com/attardi/wikiextractor/blob/f0ca16c3e92983b9094b6f32526992fc3a678f8f/wikiextractor/WikiExtractor.py#L560-L561

This means that any parameter is converted to string, and then evaluates as true unless empty. One simple way of correctly defining a boolean argument with default true value would be:

parser.add_argument("--html-safe", default=True, action=argparse.BooleanOptionalAction,
                        help="use to produce HTML safe output within <doc>...</doc>")

This way the parser would accept both --html-safe and --no-html-safe and also generate appropriate help.