allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.68k stars 225 forks source link

There is something wrong with Sentence segmentation #327

Closed Hui-zju closed 3 years ago

Hui-zju commented 3 years ago

`import spacy nlp = spacy.load('en_core_sci_sm')

print('the first entence segmentation:') doc1 = nlp('Positive for translocation or inversion events involving the ROS1 gene') for i,sent in enumerate(doc1.sents): print(i) print(sent)

print('the second entence segmentation:') doc2 = nlp('Negative for translocation or inversion events involving the ALK gene') for i,sent in enumerate(doc2.sents): print(i) print(sent)`

The output is as:

the first entence segmentation: 0 Positive for translocation or inversion events 1 involving the ROS1 gene the second entence segmentation: 0 Negative for translocation or inversion events involving the ALK gene

The result is obviously wrong!

dakinggg commented 3 years ago

You have a couple options that I know of to get different sentence segmentation. In general, nothing is going to be perfect. In particular, the default spacy sentence segmentation is based on the dependency parse and for sure can do things like the error you observed. FWIW, if you add a . at the end of the first example, it gets it right.

Options: 1) Check out the pysbd-based sentence segmentation pipe here (https://github.com/allenai/scispacy/blob/5df54e468c649e465b98ff6d924fa910eb3cb50c/scispacy/custom_sentence_segmenter.py#L12). You can add it with from scispacy.custom_sentence_segmenter import pysbd_sentencizer; nlp.add_pipe('pysbd_sentencizer', first=True) 2) You can use spacy's default rule based sentencizer by nlp.add_pipe('sentencizer', first=True)

Hui-zju commented 3 years ago

Thanks for your reply! I will try with the options.

SantoshGuptaML commented 3 years ago

With

nlp = spacy.load('en_core_sci_sm')
nlp.add_pipe('pysbd_sentencizer', first=True)

Is the scispacy model being used at all, or is it just pysbd being used?

dakinggg commented 3 years ago

you can view the whole pipeline via nlp.pipeline. It is just adding the pysbd pipe for sentence segmentation.

SantoshGuptaML commented 3 years ago

for scispacy pipeline gives

[('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f1a5969e3c0>),
 ('sentencizer', <spacy.pipeline.sentencizer.Sentencizer at 0x7f1a59754640>)]

Where as regular spacy gives

[('sentencizer', <spacy.pipeline.pipes.Sentencizer at 0x7f821ef95e50>)]

So it looks like scispacy adds a custom attribute_ruler, but both scispacy and spacy use the same sentencizer? Does that sound right?

scispacy gives much better results than spacy for abstracts. Here's an example.

en_core_sci_md:

Abstract  Our goal is to learn task-independent representations of academic papers.
Inspired by the recent success of pretrained Transformer language models across various NLP tasks, we use the Transformer model architecture as basis of encoding the input paper.
Existing LMs such as BERT, however, are primarily based on masked language modeling objective, only considering intra-document context and do not use any inter-document information.
This limits their ability to learn optimal document representations.
To learn high-quality documentlevel representations we propose using citations as an inter-document relatedness signal and formulate it as a triplet loss learning objective.
We then pretrain the model on a large corpus of citations using this objective, encouraging it to output representations that are more similar for papers that share a citation link than for those that do not.
Representation learning is a critical ingredient for natural language processing systems.
Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power.
For applications on scientific documents, such as classification and recommendation, the embeddings power strong performance on end tasks.
We propose SPECTER, a new method to generate document-level embedding of scientific documents based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph.
Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning.
Additionally, to encourage further research on document-level models, we introduce SCIDOCS, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation.
We show that SPECTER outperforms a variety of competitive baselines on the benchmark.
As the pace of scientific publication continues to increase, Natural Language Processing (NLP) tools that help users to search, discover and understand the scientific literature have become critical.
In recent years, substantial improvements in NLP tools have been brought about by pretrained neural language models (LMs) (Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019).
While such models are widely used for representing individual words ∗ Equal contribution 1 https://github.com/allenai/specter or sentences, extensions to whole-document embeddings are relatively underexplored.
Likewise, methods that do use inter-document signals to produce whole-document embeddings (Tu et al., 2017; Chen et al., 2019) have yet to incorporate stateof-the-art pretrained LMs.
Here, we study how to leverage the power of pretrained language models to learn embeddings for scientific documents.
A paper’s title and abstract provide rich semantic content about the paper, but, as we show in this work, simply passing these textual fields to an “off-the-shelf” pretrained language model—even a state-of-the-art model tailored to scientific text like the recent SciBERT (Beltagy et al., 2019)—does not result in accurate paper representations.
The language modeling objectives used to pretrain the model do not lead it to output representations that are helpful for document-level tasks such as topic classification or recommendation.
In this paper, we introduce a new method for learning general-purpose vector representations of scientific documents.
Our system, SPECTER, 2 incorporates inter-document context into the Transformer (Vaswani et al., 2017) language models (e.g., SciBERT (Beltagy et al., 2019)) to learn document representations that are effective across a wide-variety of downstream tasks, without the need for any task-specific fine-tuning of the pretrained language model.
We specifically use citations as a naturally occurring, inter-document incidental supervision signal indicating which documents are most related and formulate the signal into a triplet-loss pretraining objective.
Unlike many prior works, at inference time, our model does not require any citation information.
This is critical for embedding new papers that have not yet been cited.
In experiments, we show that SPECTER’s representations substantially outperform the state

en_core_web_sm

Abstract  Our goal is to learn task-independent representations of academic papers.
Inspired by the recent success of pretrained Transformer language models across various NLP tasks, we use the Transformer model architecture as basis of encoding the input paper.
Existing LMs such as BERT, however, are primarily based on masked language modeling objective, only considering intra-document context and do not use any inter-document information.
This limits their ability to learn optimal document representations.
To learn high-quality documentlevel representations we propose using citations as an inter-document relatedness signal and formulate it as a triplet loss learning objective.
We then pretrain the model on a large corpus of citations using this objective, encouraging it to output representations that are more similar for papers that share a citation link than for those that do not.
Representation learning is a critical ingredient for natural language processing systems.
Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power.
For applications on scientific documents, such as classification and recommendation, the embeddings power strong performance on end tasks.
We propose SPECTER, a new method to generate document-level embedding of scientific documents based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph.
Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning.
Additionally, to encourage further research on document-level models, we introduce SCIDOCS, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation.
We show that SPECTER outperforms a variety of competitive baselines on the benchmark.
As the pace of scientific publication continues to increase, Natural Language Processing (NLP) tools that help users to search, discover and understand the scientific literature have become critical.
In recent years, substantial improvements in NLP tools have been brought about by pretrained neural language models (LMs) (Radford et al.,
2018; Devlin et al.,
2019; Yang et al.,
2019).
While such models are widely used for representing individual words ∗ Equal contribution 1 https://github.com/allenai/specter or sentences, extensions to whole-document embeddings are relatively underexplored.
Likewise, methods that do use inter-document signals to produce whole-document embeddings (Tu et al.,
2017; Chen et al.,
2019) have yet to incorporate stateof-the-art pretrained LMs.
Here, we study how to leverage the power of pretrained language models to learn embeddings for scientific documents.
A paper’s title and abstract provide rich semantic content about the paper, but, as we show in this work, simply passing these textual fields to an “off-the-shelf” pretrained language model—even a state-of-the-art model tailored to scientific text like the recent SciBERT (Beltagy et al.,
2019)—does not result in accurate paper representations.
The language modeling objectives used to pretrain the model do not lead it to output representations that are helpful for document-level tasks such as topic classification or recommendation.
In this paper, we introduce a new method for learning general-purpose vector representations of scientific documents.
Our system, SPECTER, 2 incorporates inter-document context into the Transformer (Vaswani et al.,
2017) language models (e.g., SciBERT (Beltagy et al.,
2019)) to learn document representations that are effective across a wide-variety of downstream tasks, without the need for any task-specific fine-tuning of the pretrained language model.
We specifically use citations as a naturally occurring, inter-document incidental supervision signal indicating which documents are most related and formulate the signal into a triplet-loss pretraining objective.
Unlike many prior works, at inference time, our model does not require any citation information.
This is critical for embedding new papers that have not yet been cited.
In experiments, we show that SPECTER’s representations substantially outperform the state

I also tried the pysbd_sentencizer, but got an error getting it to work

import spacy
import scispacy
from scispacy.custom_sentence_segmentater import pysbd_sentencizer
nlpSciMd = spacy.load("en_core_sci_md", disable = ['ner', 'parser', 'tagger', 'lemmatizer', 'attributeruler', 'tok2vec'])
nlpSciSm = spacy.load("en_core_sci_sm", disable = ['ner', 'parser', 'tagger', 'lemmatizer', 'attributeruler', 'tok2vec'])
# nlpSciLg = spacy.load("en_core_sci_lg", disable = ['ner', 'parser', 'tagger', 'lemmatizer'])
nlpSciMd.add_pipe('pysbd_sentencizer')
nlpSciSm.add_pipe('pysbd_sentencizer')

error

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-45556ac5415d> in <module>()
      1 import spacy
      2 import scispacy
----> 3 from scispacy.custom_sentence_segmentater import pysbd_sentencizer
      4 nlpSciMd = spacy.load("en_core_sci_md", disable = ['ner', 'parser', 'tagger', 'lemmatizer', 'attributeruler', 'tok2vec'])
      5 nlpSciSm = spacy.load("en_core_sci_sm", disable = ['ner', 'parser', 'tagger', 'lemmatizer', 'attributeruler', 'tok2vec'])

ModuleNotFoundError: No module named 'scispacy.custom_sentence_segmentater'

For convenience, here are the colab notebooks where I tried to code

scispacy

https://colab.research.google.com/drive/1EleinjhYDaqU3OYb4u1odSItEY7-KP4U?usp=sharing

spacy

https://colab.research.google.com/drive/1UCh65W-yEYZzOhWDrqL_ACKSbjxWXbGI?usp=sharing

pysbd_sentencizer

https://colab.research.google.com/drive/1jYetA7G4RdRHDGmXxl3ToSBBpzw6BE36?usp=sharing

side note: in the first notebook you can see there's an error getting the small model to work.

dakinggg commented 3 years ago

If you disable everything and just add the sentencizer, you should end up with just the sentencizer, whether it is spacy or scispacy

In [8]: nlp = spacy.load('en_core_sci_sm', disable=['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'parser', 'ner'])

In [9]: nlp.pipeline
Out[9]: []

In [10]: nlp.add_pipe('sentencizer')
Out[10]: <spacy.pipeline.sentencizer.Sentencizer at 0x7faf4642b640>

In [11]: nlp.pipeline
Out[11]: [('sentencizer', <spacy.pipeline.sentencizer.Sentencizer at 0x7faf4642b640>)]

As for your error, it is just a typo. You need from scispacy.custom_sentence_segmenter import pysbd_sentencizer. Sorry about that. Fixed the typo above.

dakinggg commented 3 years ago

Closing due to inactivity. Please reopen if you are still having issues.