allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

Keyword extraction #30

Closed emanuelevivoli closed 3 years ago

emanuelevivoli commented 3 years ago

Hello, thanks for the awesome work you did! Usually in scientific papers there is a section containing the paper's KeyPhrases or KeyWords. I didn't see any section/properties named neither KeyPhrases nor KeyWords or anything similar, so my question (supposing there must be some remaining information in the body_text property) is:

Do you have any method (or any advice) for extracting this data from the body_text?

I'd like to build a "Keyphrase dataset" from the S2ORC dataset.

Thanks for your help, Emanuele.

lucylw commented 3 years ago

Emanuele, we don't extract Keywords/Keyphrases explicitly. First of all, not all papers have them, and when they are there, the format can be variable. In S2ORC, they sometimes wind up in the Abstract or first paragraphs of the body text. You can try to use regexes to pull them out. That's my best recommendation for now. Good luck!