Voting classifier1.0 - Githubissues

greatdansby commented 7 years ago

Cleaned up version, at least closer to following the style guide

rspeer commented 7 years ago

Back to looking at this. I think it's closer to being ready to merge. I still have two complaints, one of which seriously affects users of the script, the other of which is mostly stylistic.

The subset_field option is crucial to being able to use this correctly, but it's finicky and poorly explained. The option should explain that it's a prefix on subset names that will be used as labels for classification. It should also be made more robust with how it matches the subset names, so it can match subsets even if there are spaces, colons, or differences in capitalization.

The stylistic thing is: short options, introduced with a single hyphen, are usually a single character. I didn't even realize argparse would allow you to add multi-character short options. I assume that this breaks the functionality where you can stack options together (like the command ls -l -t can be shortened to ls -lt).

The training data and testing data aren't optional anyway. I think they should just be positional arguments.

rspeer commented 7 years ago

This is still going to raise ValueError("Subset prefix must contain colon between prefix and value.") unnecessarily, in the case where there are different subset prefixes and one is a prefix of the other.

Instead of raising a ValueError, not finding the colon where you expect it should just not be a match.

For example, if you are matching a prefix of 'Label:' and a prefix of 'Label2:' also exists, this will unnecessarily raise a ValueError.

greatdansby commented 7 years ago

I don't think that is true. Line 89 inserts the colon in the case where it is missing, and before the subset_field is used to match against existing subsets. So if "label" was specified, only "Label:" and not "Label2:" would be captured.

rspeer commented 7 years ago

Ah, I see. So that ValueError is there in case of a logic error in the code, not a problem with the data.

rspeer commented 7 years ago

Looking pretty good then. The one other thing I notice is that the binary_rating_labeler is still in there, and appears to be kind of left over from my code; I don't think there's a way to use it with this code, so it should just be removed.

greatdansby commented 7 years ago

get_test_docs_from_file allows for a custom function (e.g. binary_rating_labeler), so if you imported the voting_classifier you could still use it, or any other custom labeling function. If you don't think it's worth keeping, I'll pull it.

rspeer commented 7 years ago

I think at this point the function would just be specific to the data -- this version of the script doesn't come with sample data that uses binary_rating_labeler, so let's not include it.

LuminosoInsight / sales-engineering-code

Voting classifier1.0 #29