allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.72k stars 229 forks source link

Modify short form filter function #452

Closed andyjessen closed 2 years ago

andyjessen commented 2 years ago

Short form filter omits words with length of 10. Consider changing to be inclusive of 10.

dakinggg commented 2 years ago

Thanks for the contribution! Is this motivated by some real cases you've run into? I don't quite remember whether we had any particular reason for the current length constraint, or it was more of an arbitrary choice. I'm always a bit hesitant to make subtle changes to default behavior.

andyjessen commented 2 years ago

No. I don't have a real case. I noticed that the code doesn't match the comment above it. It also doesn't match the algorithm from the 2003 paper.

Short forms are considered valid candidates only if they consist of at most two words, their length is between two to ten characters, at least one of these characters is a letter, and the first character is alphanumeric.

dakinggg commented 2 years ago

I believe it actually still doesn't match the 2003 paper (which is fine). The paper says Short forms are considered valid candidates only if they consist of at most two words, their length is between 2 to 10 characters, at least one of these characters is a letter, and the first character is alphanumeric. We allow each individual word to be between length 2 and 10, vs the original that requires the whole sequence to be between length 2 and 10. In the absence of a compelling reason to change the default behavior, I'm inclined to just update the comment to reflect what the code does. Does that seem reasonable?

andyjessen commented 2 years ago

Yes. That sounds good.