allenai / scispacy

A full spaCy pipeline and models for scientific/biomedical documents.
https://allenai.github.io/scispacy/
Apache License 2.0
1.68k stars 225 forks source link

`Found array with 0 sample(s)` #144

Closed ibeltagy closed 5 years ago

ibeltagy commented 5 years ago

Lucy's team ran into this bug during the hackathon

>>> nlp("hydroxytryptophan")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-231-7e41c1b0131c> in <module>
----> 1 nlp("hydroxytryptophan")

//anaconda/envs/scispacy/lib/python3.6/site-packages/spacy/language.py in __call__(self, text, disable, component_cfg)
    393             if not hasattr(proc, "__call__"):
    394                 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 395             doc = proc(doc, **component_cfg.get(name, {}))
    396             if doc is None:
    397                 raise ValueError(Errors.E005.format(name=name))

//anaconda/envs/scispacy/lib/python3.6/site-packages/scispacy/umls_linking.py in __call__(self, doc)
     85 
     86         mention_strings = [x.text for x in mentions]
---> 87         batch_candidates = self.candidate_generator(mention_strings, self.k)
     88 
     89         for mention, candidates in zip(doc.ents, batch_candidates):

//anaconda/envs/scispacy/lib/python3.6/site-packages/scispacy/candidate_generation.py in __call__(self, mention_texts, k)
    201         if self.verbose:
    202             print(f'Generating candidates for {len(mention_texts)} mentions')
--> 203         tfidfs = self.vectorizer.transform(mention_texts)
    204         start_time = datetime.datetime.now()
    205 

//anaconda/envs/scispacy/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, raw_documents, copy)
   1679 
   1680         X = super().transform(raw_documents)
-> 1681         return self._tfidf.transform(X, copy=False)
   1682 
   1683     def _more_tags(self):

//anaconda/envs/scispacy/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in transform(self, X, copy)
   1300         vectors : sparse matrix, [n_samples, n_features]
   1301         """
-> 1302         X = check_array(X, accept_sparse='csr', dtype=FLOAT_DTYPES, copy=copy)
   1303         if not sp.issparse(X):
   1304             X = sp.csr_matrix(X, dtype=np.float64)

//anaconda/envs/scispacy/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    548                              " minimum of %d is required%s."
    549                              % (n_samples, array.shape, ensure_min_samples,
--> 550                                 context))
    551 
    552     if ensure_min_features > 0 and array.ndim == 2:

ValueError: Found array with 0 sample(s) (shape=(0, 53479)) while a minimum of 1 is required.
dakinggg commented 5 years ago

looks like the linking pipe crashes if there are no entities found in the doc, which is pretty rare for the base detectors trained on medmentions. i'll fix real quick