bmabey / pyLDAvis

Python library for interactive topic model visualization. Port of the R LDAvis package.
BSD 3-Clause "New" or "Revised" License
1.81k stars 363 forks source link

TypeError: (-0.0025023526479494543+0j) is not JSON serializable , with sklearn & tfidf dtm #69

Open subhrm opened 8 years ago

subhrm commented 8 years ago

First of all thanks to the creator and all the contributors of this amazing module.

Today I encountered this issue. I was following the example sklearn notebook and was able to successfully get the visualization for LDA model with tf (CountVectorizer) dtm .

But when I tried to use the TfidfVectorizer , I am getting this issue . Please find below the my code snippet as well the stack-trace of the issue.

pyLDAvis.sklearn.prepare(lda_tfidf, tfidf, tfidf_vectorizer, R=10,sort_topics=False)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\IPython\core\formatters.py in __call__(self, obj)
    337                 pass
    338             else:
--> 339                 return printer(obj)
    340             # Finally look for special method names
    341             method = _safe_get_formatter_method(obj, self.print_method)

C:\Anaconda3\lib\site-packages\pyLDAvis\_display.py in <lambda>(data, kwds)
    311     formatter = ip.display_formatter.formatters['text/html']
    312     formatter.for_type(PreparedData,
--> 313                        lambda data, kwds=kwargs: prepared_data_to_html(data, **kwds))
    314 
    315 

C:\Anaconda3\lib\site-packages\pyLDAvis\_display.py in prepared_data_to_html(data, d3_url, ldavis_url, ldavis_css_url, template_type, visid, use_http)
    176                            d3_url=d3_url,
    177                            ldavis_url=ldavis_url,
--> 178                            vis_json=data.to_json(),
    179                            ldavis_css_url=ldavis_css_url)
    180 

C:\Anaconda3\lib\site-packages\pyLDAvis\_prepare.py in to_json(self)
    414 
    415     def to_json(self):
--> 416        return json.dumps(self.to_dict(), cls=NumPyEncoder)

C:\Anaconda3\lib\json\__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    235         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    236         separators=separators, default=default, sort_keys=sort_keys,
--> 237         **kw).encode(obj)
    238 
    239 

C:\Anaconda3\lib\json\encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

C:\Anaconda3\lib\json\encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

C:\Anaconda3\lib\site-packages\pyLDAvis\utils.py in default(self, obj)
    144         if isinstance(obj, np.float64) or isinstance(obj, np.float32):
    145             return float(obj)
--> 146         return json.JSONEncoder.default(self, obj)

C:\Anaconda3\lib\json\encoder.py in default(self, o)
    178 
    179         """
--> 180         raise TypeError(repr(o) + " is not JSON serializable")
    181 
    182     def encode(self, o):

TypeError: (-0.0025023526479494543+0j) is not JSON serializable

Any help to resolve this would be much appreciated.

I am also trying to find a resolution for this issue and if I could resolve it on my own , I would let you know .

bmabey commented 8 years ago

For some reason there is a complex number in your matrix. To fix the issue the NumPyEncoder would need to be extended to handle complex numbers: https://github.com/bmabey/pyLDAvis/blob/master/pyLDAvis/utils.py#L140-L146

I'm not sure what the best way to handle them would be though. My first thought was to only take the real part. So, we could either do that or you could do the same before sending it in to pyLDAvis.

bmabey commented 8 years ago

@subhrm did you ever resolve your problem? If we think complex numbers are going to be a common issue I would merge in a PR that extends the encoder as mentioned above.

subhrm commented 8 years ago

@bmabey No I have not been able to resolve it. Couple of my colleagues are also getting same error message with different corpus.

I am now trying to figure out a way to convert those complex numbers to real , by either dropping the imaginary part or calculating and keeping their magnitude .

But if you or any other contributor can do some change in pyLDAvis code base that smartly goes around this issue , it would be great !

Thanks, Subhendu

DontUseThisCodeInProduction commented 7 years ago

I ran into the JSON serializable problem when calling pyLDAvis.show() - same issue with a failure in NumPyEncoder.

I was able to control the problem based on how many topics (num_topics) I used when creating the LDA model - gensim.models.ldamodel.LdaModel. If I set the number of topics to 10 or more the problem occurred; 9 or fewer and it did not. Maybe this is based on the corpus I used.

I ended up modifying NumPyEncoder to return abs() when it encountered a complex number. I'm not an expert on these codebases so I don't know what the side effects of this are, but the visualization was able to run after I did this.

And finally, pyLDAvis is a sweet, sweet module. Very useful.

krageon commented 7 years ago

I ran into the same issue. Editing pyLDAvis/utils.py and adding

        if np.iscomplexobj(obj):
            return abs(obj)

to the ifs in NumPyEncoder, making it

class NumPyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.int64) or isinstance(obj, np.int32):
            return int(obj)
        if isinstance(obj, np.float64) or isinstance(obj, np.float32):
            return float(obj)
        if np.iscomplexobj(obj):
            return abs(obj)
        return json.JSONEncoder.default(self, obj)

solved the issue for me (or at least it will actually display something now).

sohomghosh commented 7 years ago

Thanks @krageon and @bmabey . Editing utils.py in the way you mentioned works!

ghost commented 7 years ago

I changed the pyLDAvis/utils.py and included

class NumPyEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, np.int64) or isinstance(obj, np.int32): return int(obj) if isinstance(obj, np.float64) or isinstance(obj, np.float32): return float(obj) if np.iscomplexobj(obj): return abs(obj) return json.JSONEncoder.default(self, obj)

I still get an error when I run the code on ipython notebook - TypeError: 0j is not JSON serializable

krageon commented 7 years ago

I think I originally divined what was going on by using a Python debugger (https://docs.python.org/3/library/pdb.html) and breaking in this function (on return json.JSONEncoder.default(self, obj)) - then you can perform some tests on the obj in question to see what it is that is in here - that should provide some insight in how to fix it.

ghost commented 7 years ago

Can you help me out I am still struggling with this error and I am not able to get able visualizations results of my LDA analysis.

krageon commented 7 years ago

Did you follow the steps I outlined in my last post? I might be able to tell you something about what's going on with that information.

bmabey commented 7 years ago

It appears that enough people are running into this so we should merge a fix into the library. Will someone send me a PR with a fix that worked for them?

krageon commented 7 years ago

If the proposed change (taking the absolute value of an imaginary number in case an imaginary number hits that function) doesn't misrepresent the data horrifically, I think that can be arranged.

bmabey commented 7 years ago

I can't really answer that question since I've never ran into a case where this was required. Do you have any idea why imaginary numbers are being used in the first place?

krageon commented 7 years ago

I think there was an sqrt somewhere, and the number is negative. Why all that is happening isn't something I'm comfortable answering - it's been a very long time since math class. I seem to recall this problem occurred when I had my topics set to a high number (say, 150-ish) and tried to visualise.

seanlane commented 7 years ago

Ran into this issue while debugging with a colleague, but you may want to double check your data if you're using the built-in method abs to correct the issue, like @krageon described. That will return the magnitude of the number, which will always be positive.

In our case, the data we had included negative values, so we used the following code instead to return the real part, as opposed to the magnitude:

class NumPyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.int64) or isinstance(obj, np.int32):
            return int(obj)
        if isinstance(obj, np.float64) or isinstance(obj, np.float32):
            return float(obj)
        if np.iscomplexobj(obj):
            return np.real(obj)
        return json.JSONEncoder.default(self, obj)

Also in our data, the imaginary part of each number appeared to be zero, so I believe this was the correct action for us. That said, if you're data contains an imaginary part then taking the magnitude might be better. I'm not familiar enough with the library to understand what is occurring at that moment, but just a heads up for anyone else coming across this.

krageon commented 7 years ago

The points you make are good - this is exactly what I meant when I said "If the proposed change doesn't misrepresent the data horrifically". There will be some cases where it does, because data is being lost.

Whether or not that is the right or the wrong data to lose is not a call I can make for the general case. This is why I have not made a PR, and why I'm not comfortable making one until I either have time to brush up on the source material or someone with a strong theoretical grounding presents a good argument either way.

zdenekhynek commented 7 years ago

I had a similar problems and tracked down that the complex number were coming from the topic coordinates calculation.

What worked for me was not to rely on the default js_PCoA mds function and use mmds instead.

pyLDAvis.gensim.prepare(lda_model, corpus, dictionary, mds='mmds')

Tbh, it is very very well possible that I'm doing something wrong and my 'solution' just masks the initial problem.

ahmetctoker commented 5 years ago

What helped me was to include sklearn. PC analysis is made with an alternative library in case sklearn is not included. Since the data which has imaginary parts is about the calculation of angles, I suspect there is some problem with calculation.

SamambaMan commented 5 years ago

I'm having the same problem! Have this issue been fixed?

[2019-08-12 19:39:01,124: ERROR/ForkPoolWorker-24] Task filtro.tasks.classificar_baixados[b958c2c4-93c9-49dd-88fc-8a43cef4dea2] raised unexpected: TypeError("Object of type 'complex' is not JSON serializable",)
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/app-root/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/app-root/src/filtro/tasks.py", line 150, in classificar_baixados
    aplicar_lda(m_filtro)
  File "/opt/app-root/src/filtro/tasks.py", line 160, in aplicar_lda
    dados = modelar_lda(conteudos)
  File "/opt/app-root/src/filtro/analysis.py", line 24, in modelar_lda
    saida = pyLDAvis.prepared_data_to_html(modelo)
  File "/opt/app-root/lib/python3.6/site-packages/pyLDAvis/_display.py", line 178, in prepared_data_to_html
    vis_json=data.to_json(),
  File "/opt/app-root/lib/python3.6/site-packages/pyLDAvis/_prepare.py", line 417, in to_json
    return json.dumps(self.to_dict(), cls=NumPyEncoder)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/opt/app-root/lib/python3.6/site-packages/pyLDAvis/utils.py", line 146, in default
    return json.JSONEncoder.default(self, obj)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'complex' is not JSON serializable
philippemiron commented 2 years ago

I had the same issue... as a workaround:

vis.topic_coordinates['x'] = np.real(vis.topic_coordinates['x'])
vis.topic_coordinates['y'] = np.real(vis.topic_coordinates['y'])
vis
mbosten commented 5 months ago

For anyone still having trouble with this error, setting the normalization parameter to None instead of the 'l2' default worked for me. That is, vectorizer = TfidfVectorizer(min_df=2, norm=None)

I am not sure why this works as I am fairly unfamiliar with the mathematics behind the code, but it seems to produce results that are consistent with the underlying data in my case.