datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
152 stars 52 forks source link

ValueError in setting date range index in compute_activity (Example Notebook Corr Centrality Community) #246

Closed nllz closed 6 years ago

nllz commented 8 years ago

At the following step:

new_dict = [{} for i in repeat(None, total_month)]
new_dict1 = [{} for i in repeat(None, total_month)]
for t in range(total_month):
    filtered_activity = []
    for i in range(5):
        df = archives[i].data
        fdf = filter_by_date(df,date_from[t],date_to[t])
        filtered_activity.append(Archive(fdf).get_activity().sum())
    for k in range(len(filtered_activity)):
        for g in range(len(filtered_activity[k])):
            original_key = filtered_activity[k].keys()[g]
            new_key = (original_key[original_key.index("(") + 1:original_key.rindex(")")])
            if new_key not in new_dict[t]:
                new_dict[t][new_key] = 0
                new_dict1[t][new_key] = 0
            new_dict[t][new_key] += math.log(filtered_activity[k].get_values()[g]+1)
            #can define community membership by changing the above line.
            #example, direct sum of emails would be 
            new_dict1[t][new_key] += filtered_activity[k].get_values()[g]

I get this error:

ValueError                                Traceback (most recent call last)
<ipython-input-11-e2560bdec7c0> in <module>()
      6         df = archives[i].data
      7         fdf = filter_by_date(df,date_from[t],date_to[t])
----> 8         filtered_activity.append(Archive(fdf).get_activity().sum())
      9     for k in range(len(filtered_activity)):
     10         for g in range(len(filtered_activity[k])):

/home/gogol/Data/bigbang/bigbang/archive.pyc in get_activity(self, resolved)
    111         """
    112         if self.activity is None:
--> 113             self.activity = self.compute_activity(self)
    114 
    115         if resolved:

/home/gogol/Data/bigbang/bigbang/archive.pyc in compute_activity(self, clean)
    137             ['From', 'Date']).size().unstack('From').fillna(0)
    138 
--> 139         new_date_range = np.arange(mdf2['Date'].min(), mdf2['Date'].max())
    140         # activity.set_index('Date')
    141 

ValueError: cannot convert float NaN to integer
sbenthall commented 8 years ago

Thanks for making this ticket!

I have been unable to reproduce this error running things locally.

Were are talking about this notebook, correct?

https://github.com/sbenthall/bigbang/blob/master/examples/Corr%20between%20centrality%20and%20community%200.1.ipynb

Can you tell me which version of Python you are using?

Also, which mailing list data are you using for this?

A general problem with the mailing list analysis is that email is actually very messy data because its standard is loose and leaves a lot of room for interpretation by clients. So it's possible that the data you are using has values that break the current code.

nllz commented 8 years ago

Am using Anaconda 2 with Python version 2.7.11, and that is ndeed the notebook, I altered it a bit to use if with lists.ncuc.org/cgi-bin/mailman/listinfo/ncuc-discuss and https://mm.icann.org/pipermail/cc-humanrights/.

Here is my notebook (added txt to be able to upload here): Corr between centrality and community 0.1 ICANN.ipynb.txt

Thanks!

sbenthall commented 8 years ago

ok I've been able to replicate the problem locally. note sure about the fix yet.

How high a priority is getting this notebook to work for you?

nllz commented 8 years ago

Single word trend and special word analysis have a higher priority for me.

npdoty commented 6 years ago

I'm running into this as well, with some of the malformed Dates that I'm getting while parsing IETF lists.

I'm confused by the error though. Both .min() and .max() are supposed to ignore NA values when calculating, so I'm not sure how one of them could end up returning NaN.

Also, honestly, I have no idea what the purpose of these lines is. Why do we need to reindex activity here? Why does it need to be reindexed with a calculated range, rather than just on that column?

Earlier in the method, there's a "# unnecessary?" comment which is maybe supposed to drop NA values although I'm not sure that it actually does.

npdoty commented 6 years ago

Debugging note, I'm particularly seeing this issue when there was an error loading the archives and I actually have zero records in the dataframe. (Maybe min() and max() return NaN when there are no rows.)

sbenthall commented 6 years ago

Trying to answer the questions here: https://github.com/datactive/bigbang/issues/246#issuecomment-364621033

sbenthall commented 6 years ago

I am unable to replicate this error.

sbenthall commented 6 years ago

Reassigning to @npdoty because I'm not how to proceed, and he's encountered the error.

npdoty commented 6 years ago

I believe it is because of empty data. That error is of course very confusing and doesn't hint that the problem is that the Archive has no data in it. I've opened a PR to change behavior to throw an explicit MissingDataException so that we get a clear explanation of what the problem is and the user can handle it as they wish.