I'd recommend that you remove the last three lines of the current file and replace "file_text" below with a string that exercises the major cases of dictionary extraction:
Match at beginning of string, end of string, or in the middle of the string
One-token match and multi-token match
Non-match that shares the first token (and only the first token) with a two-token dictionary entry
Two overlapping matches
You'll also want to exercise case-insensitivity of the dictionary matching.
I think the location of this file is an anachronism. Would you mind moving it to test_data/spanner?
comments on test_extract_regex_tok:
As with the dictionary test, it would be useful to have a target string that contains the main types of regex match -- matches at the beginning, middle, or end of the string; partial matches; substrings that would be matches except they don't start or end on a token boundary.
Currently only a simple test case exists. As per comments at https://github.com/CODAIT/text-extensions-for-pandas/pull/83#discussion_r474330942, more tests need to be added to exercise the function completely.
Fred's comments on
text_extract_dict
:I'd recommend that you remove the last three lines of the current file and replace "file_text" below with a string that exercises the major cases of dictionary extraction:
You'll also want to exercise case-insensitivity of the dictionary matching.
I think the location of this file is an anachronism. Would you mind moving it to test_data/spanner?
comments on
test_extract_regex_tok
:As with the dictionary test, it would be useful to have a target string that contains the main types of regex match -- matches at the beginning, middle, or end of the string; partial matches; substrings that would be matches except they don't start or end on a token boundary.