Name-based type inference?

Zac-HD commented 1 year ago

👋 I just found this project, and --annotate-named-param reminds me of a weekend research project I did for the Hypothesis ghostwriter, analyzing a few hundred million arguments in a corpus of Python code: https://github.com/HypothesisWorks/hypothesis/issues/3311. The context is different enough between our projects that you might want to make your own decisions based on the table in that issue rather than just lifting code out of my PR (hereby offered under MIT).

Clauses like if name.startswith("is_") or name in BOOL_NAMES: aren't fully expressible with --annotate-named-param, so this might be worth a new --guess-common-names argument as well as a table of suggestions, e.g. "pattern is usually either str or re.Pattern[str], or sometimes bytes".

JelleZijlstra commented 1 year ago

Thanks for bringing this up! The data you gathered is definitely useful for a project like autotyping.

Zac-HD commented 1 year ago

Explicitly checking, would you accept a PR that adds a new --guess-common-names argument to autotyping for this?

JelleZijlstra commented 1 year ago

Yes

jakkdl commented 1 year ago

Should I only take the ones with a checkmark in the should guess? column - or have a separate "extended" flag that also includes the common names that "shouldn't" be guessed? (Or force the user to add those themselves with --annotate-optional and --annotate-named-param)

Should --guess-common-names also turn on guessing when the default is None, i.e. from

def foo(bar=None):
  ...

infer

def foo(bar: Optional[str] = None):
 ...

and/or should that be a separate flag, something like --guess-common-names-optionals

For non-bool defaults I don't see any reason to change anything since that's already inferred.

JelleZijlstra commented 1 year ago

Thanks for your work! I'd be curious to hear @Zac-HD's opinions too, but here are my reactions:

I'd rather not add a separate "extended" flag. Instead, maybe the flag could take an argument somehow: --guess-common-names is equivalent to --guess-common-names=1, and --guess-common-names=2 (or --guess-common-names --guess-common-names) adds more speculative types.
Arguments with None defaults should automatically get the Optional version of the type.

Zac-HD commented 1 year ago

I would not guess more speculative types; I think they're wrong too often to be useful. And instead of looking at the table, just grab the code that I wrote:https://github.com/HypothesisWorks/hypothesis/pull/3313/files#diff-9cc5e151a0e8c7f741d59c6406f5c0ac31f6284b29072f1dc264255a2dab5d91 (noting that it distinguishes some strategies that are the same type).

jakkdl commented 1 year ago

Cool - implemented it and pushed to #48, though leaving it as draft as there's a couple ones I'm unsure about:

Should generics be typed at all? They're almost always gonna need manual intervention anyway, so I'm unsure about the value of adding List as a type. List[Any] is ofc also an option which one could later weed out with e.g. mypy's --disallow-any-generics.
name in ("pred", "predicate")
- typing.Callable doesn't accept just setting the return value, so [if we're annotating generics] I'm just annotating this as a generic Callable.
name in ("amount", "threshold", "number", "num")
- GhostWrites types these as int|float, guessing I should just leave them alone?
"uuid" in name
- GhostWriter returns st.uuids().map(str) - is that just a str?

JelleZijlstra commented 1 year ago

typing.Callable doesn't accept just setting the return value, so [if we're annotating generics] I'm just annotating this as a generic Callable.

It does, Callable[..., ReturnType]. But this has the same problems as with using List.

Zac-HD commented 1 year ago

Should generics be typed at all? They're almost always gonna need manual intervention anyway, so I'm unsure about the value of adding List as a type.

Let's just skip them then, the basics are remarkably helpful already and probably have a lower error rate.

jakkdl commented 1 year ago

well... I got feeling and wrote inferences for handling stuff like int_list and list_of_widths. But everything that can't be fully inferred is skipped. And more complicated generics like dict or callable are skipped.

Skipping uuid too - it seems to mostly be typed as str - but uuid.UUID or a hexadecimal int feels common enough that it should be skipped.

JelleZijlstra / autotyping

Name-based type inference? #46