allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.68k stars 225 forks source link

Modify short form filter function #452

Closed andyjessen closed 1 year ago

andyjessen commented 1 year ago

Short form filter omits words with length of 10. Consider changing to be inclusive of 10.

dakinggg commented 1 year ago

Thanks for the contribution! Is this motivated by some real cases you've run into? I don't quite remember whether we had any particular reason for the current length constraint, or it was more of an arbitrary choice. I'm always a bit hesitant to make subtle changes to default behavior.

andyjessen commented 1 year ago

No. I don't have a real case. I noticed that the code doesn't match the comment above it. It also doesn't match the algorithm from the 2003 paper.

Short forms are considered valid candidates only if they consist of at most two words, their length is between two to ten characters, at least one of these characters is a letter, and the first character is alphanumeric.

dakinggg commented 1 year ago

I believe it actually still doesn't match the 2003 paper (which is fine). The paper says Short forms are considered valid candidates only if they consist of at most two words, their length is between 2 to 10 characters, at least one of these characters is a letter, and the first character is alphanumeric. We allow each individual word to be between length 2 and 10, vs the original that requires the whole sequence to be between length 2 and 10. In the absence of a compelling reason to change the default behavior, I'm inclined to just update the comment to reflect what the code does. Does that seem reasonable?

andyjessen commented 1 year ago

Yes. That sounds good.