Zeta-and-Company / pydistinto

pydistinto - a Python implementation of different measures of distinctiveness for contrastive text analysis
9 stars 6 forks source link

'scaling_results' function in calculate_simple.py breaks on Series with inf values #8

Open DanilSko opened 10 months ago

DanilSko commented 10 months ago

function scaling_results (https://github.com/Zeta-and-Company/pydistinto/blob/970fd4cdb262b6bb8a1c27bd3643f2500de220e0/scripts/pipeline/calculate_simple.py#L101) sends an incoming pandas Series to sklearn-s .fit_transform which returns an Error if a Series contains inf values. Why these inf values occur in the first place is unclear to me, but I have at least one corpus on which it reproduces (see demo.zip attached, metadata csv included; in the parameters.txt set contrast=generated). Strangely to me, it's actually a subset of a larger corpus, on which the error does not reproduce. But on this smaller subset it does. For my own purposes I temporarily 'fixed' the issue by adding Series = Series[~np.isinf(Series)] to the function, but that's just my stopgap.