blueprints-for-text-analytics-python / blueprints-text

Jupyter notebooks for our O'Reilly book "Blueprints for Text Analysis Using Python"
Apache License 2.0
250 stars 139 forks source link

ch08 - Summarizing Text Using Machine Learning #29

Open amscosta opened 8 months ago

amscosta commented 8 months ago

Hello Running jupyter notebook locally for the section : Summarizing Text Using Machine Learning. Code stops with error when try to apply the topN function : topN = lambda x: x <= np.ceil(compression_factor * x.max())

     train_df['summaryPost'] = train_df.groupby('ThreadID')['rank'].apply(topN)

(Code from section 1.2 and 1.3 loaded successfully /!python -m spacy download en_core_web_sm/!pip install textdistance) With the huge following pink error :

ValueError Traceback (most recent call last) File D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\frame.py:11610, in _reindex_for_setitem(value, index) 11609 try:

11610 reindexed_value = value.reindex(index)._values 11611 except ValueError as err: 11612 # raised in MultiIndex.from_tuples, see test_insert_error_msmgs

File D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\series.py:4918, in Series.reindex(self, index, axis, method, copy, level, fill_value, limit, tolerance) 4901 @doc( 4902 NDFrame.reindex, # type: ignore[has-type] 4903 klass=_shared_doc_kwargs["klass"], (...) 4916 tolerance=None, 4917 ) -> Series: -> 4918 return super().reindex( 4919 index=index, 4920 method=method, 4921 copy=copy, 4922 level=level, 4923 fill_value=fill_value, 4924 limit=limit, 4925 tolerance=tolerance, 4926 )

File D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\generic.py:5360, in NDFrame.reindex(self, labels, index, columns, axis, method, copy, level, fill_value, limit, tolerance) 5359 # perform the reindex on the axes -> 5360 return self._reindex_axes( 5361 axes, level, limit, tolerance, method, fill_value, copy 5362 ).finalize(self, method="reindex")

File D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\generic.py:5375, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 5374 ax = self._get_axis(a) -> 5375 new_index, indexer = ax.reindex( 5376 labels, level=level, limit=limit, tolerance=tolerance, method=method 5377 ) 5379 axis = self._get_axis_number(a)

File D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\indexes\base.py:4279, in Index.reindex(self, target, method, level, limit, tolerance) 4277 indexer, _ = self.get_indexer_non_unique(target) -> 4279 target = self._wrap_reindex_result(target, indexer, preserve_names) 4280 return target, indexer

File D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\indexes\multi.py:2490, in MultiIndex._wrap_reindex_result(self, target, indexer, preserve_names) 2489 try: -> 2490 target = MultiIndex.from_tuples(target) 2491 except TypeError: 2492 # not all tuples, see test_constructor_dict_multiindex_reindex_flat

File D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\indexes\multi.py:211, in names_compat..new_meth(self_or_cls, *args, *kwargs) 209 kwargs["names"] = kwargs.pop("name") --> 211 return meth(self_or_cls, args, **kwargs)

File D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\indexes\multi.py:590, in MultiIndex.from_tuples(cls, tuples, sortorder, names) 588 tuples = np.asarray(tuples._values) --> 590 arrays = list(lib.tuples_to_object_array(tuples).T) 591 elif isinstance(tuples, list):

File D:\blueprints-text\ch09ev\lib\site-packages\pandas_libs\lib.pyx:2894, in pandas._libs.lib.tuples_to_object_array()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'

The above exception was the direct cause of the following exception:

TypeError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_9532\1469566075.py in ?() ----> 1 train_df['summaryPost'] = train_df.groupby('ThreadID')['rank'].apply(topN)

D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\frame.py in ?(self, key, value) 3946 # Column to set is duplicated 3947 self._setitem_array([key], value) 3948 else: 3949 # set column -> 3950 self._set_item(key, value)

D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\frame.py in ?(self, key, value) 4139 4140 Series/TimeSeries will be conformed to the DataFrames index to 4141 ensure homogeneity. 4142 """ -> 4143 value = self._sanitize_column(value) 4144 4145 if ( 4146 key in self.columns

D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\frame.py in ?(self, value) 4863 # or through loc single_block_path 4864 if isinstance(value, DataFrame): 4865 return _reindex_for_setitem(value, self.index) 4866 elif is_dict_like(value): -> 4867 return _reindex_for_setitem(Series(value), self.index) 4868 4869 if is_list_like(value): 4870 com.require_length_match(value, self.index)

D:\blueprints-text\ch09ev\lib\site-packages\pandas\core\frame.py in ?(value, index) 11613 if not value.index.is_unique: 11614 # duplicate axis 11615 raise err 11616

11617 raise TypeError( 11618 "incompatible index of inserted column with frame index" 11619 ) from err 11620 return reindexed_value

TypeError: incompatible index of inserted column with frame index

amscosta commented 7 months ago

I made a typo : The blueprint is from ch09