Regex - Githubissues

datacamp / course-resources-ml-with-experts-budgets

Further student resources for DrivenData's 'Machine Learning with the Experts: School Budgets' DataCamp course.

MIT License

558 stars 632 forks source link

Open stedes opened 6 years ago

stedes commented 6 years ago

I think the regex expression is wrong.

TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\s+)'

Doesn't this mean that you only consider tokens if they contain only alphanumeric characters and are followed by white space ?

Example: WORD1,WORD2, WORD3, WORD4 Word5

In the above sentence WORD4 and Word5 would be considered as tokens as the other words have a comma in them and as such are not valid tokens.

tab1tha commented 4 years ago

I think all the WORDS will be considered as tokens. The first four will be split on the comma, and the fifth will be split on the whitespace.