ValueError in setting date range index in compute_activity (Example Notebook Corr Centrality Community)

nllz commented 8 years ago

At the following step:

new_dict = [{} for i in repeat(None, total_month)]
new_dict1 = [{} for i in repeat(None, total_month)]
for t in range(total_month):
    filtered_activity = []
    for i in range(5):
        df = archives[i].data
        fdf = filter_by_date(df,date_from[t],date_to[t])
        filtered_activity.append(Archive(fdf).get_activity().sum())
    for k in range(len(filtered_activity)):
        for g in range(len(filtered_activity[k])):
            original_key = filtered_activity[k].keys()[g]
            new_key = (original_key[original_key.index("(") + 1:original_key.rindex(")")])
            if new_key not in new_dict[t]:
                new_dict[t][new_key] = 0
                new_dict1[t][new_key] = 0
            new_dict[t][new_key] += math.log(filtered_activity[k].get_values()[g]+1)
            #can define community membership by changing the above line.
            #example, direct sum of emails would be 
            new_dict1[t][new_key] += filtered_activity[k].get_values()[g]

I get this error:

ValueError                                Traceback (most recent call last)
<ipython-input-11-e2560bdec7c0> in <module>()
      6         df = archives[i].data
      7         fdf = filter_by_date(df,date_from[t],date_to[t])
----> 8         filtered_activity.append(Archive(fdf).get_activity().sum())
      9     for k in range(len(filtered_activity)):
     10         for g in range(len(filtered_activity[k])):

/home/gogol/Data/bigbang/bigbang/archive.pyc in get_activity(self, resolved)
    111         """
    112         if self.activity is None:
--> 113             self.activity = self.compute_activity(self)
    114 
    115         if resolved:

/home/gogol/Data/bigbang/bigbang/archive.pyc in compute_activity(self, clean)
    137             ['From', 'Date']).size().unstack('From').fillna(0)
    138 
--> 139         new_date_range = np.arange(mdf2['Date'].min(), mdf2['Date'].max())
    140         # activity.set_index('Date')
    141 

ValueError: cannot convert float NaN to integer

sbenthall commented 8 years ago

Thanks for making this ticket!

I have been unable to reproduce this error running things locally.

Were are talking about this notebook, correct?

https://github.com/sbenthall/bigbang/blob/master/examples/Corr%20between%20centrality%20and%20community%200.1.ipynb

Can you tell me which version of Python you are using?

Also, which mailing list data are you using for this?

A general problem with the mailing list analysis is that email is actually very messy data because its standard is loose and leaves a lot of room for interpretation by clients. So it's possible that the data you are using has values that break the current code.

nllz commented 8 years ago

Am using Anaconda 2 with Python version 2.7.11, and that is ndeed the notebook, I altered it a bit to use if with lists.ncuc.org/cgi-bin/mailman/listinfo/ncuc-discuss and https://mm.icann.org/pipermail/cc-humanrights/.

Here is my notebook (added txt to be able to upload here): Corr between centrality and community 0.1 ICANN.ipynb.txt

Thanks!

sbenthall commented 8 years ago

ok I've been able to replicate the problem locally. note sure about the fix yet.

How high a priority is getting this notebook to work for you?

nllz commented 8 years ago

Single word trend and special word analysis have a higher priority for me.

npdoty commented 6 years ago

I'm running into this as well, with some of the malformed Dates that I'm getting while parsing IETF lists.

I'm confused by the error though. Both .min() and .max() are supposed to ignore NA values when calculating, so I'm not sure how one of them could end up returning NaN.

Also, honestly, I have no idea what the purpose of these lines is. Why do we need to reindex activity here? Why does it need to be reindexed with a calculated range, rather than just on that column?

Earlier in the method, there's a "# unnecessary?" comment which is maybe supposed to drop NA values although I'm not sure that it actually does.

npdoty commented 6 years ago

Debugging note, I'm particularly seeing this issue when there was an error loading the archives and I actually have zero records in the dataframe. (Maybe min() and max() return NaN when there are no rows.)

sbenthall commented 6 years ago

Trying to answer the questions here: https://github.com/datactive/bigbang/issues/246#issuecomment-364621033

That data cleaning step may be #unnecessary because there are other steps where the data is getting cleaned in similar ways. It may be redundant. But that depends on exactly how these methods get called.
mdf.dropna(subset=['Date']) drops those rows that have an NA value in the Date column
I think you comment here https://github.com/datactive/bigbang/issues/246#issuecomment-364790212 is the right insight. arange takes integer arguments https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html
The purpose of the compute_activity method is to get a datastructure that tells you for each day in the history of the group how many messages were sent to the group on that day. Like a timeline of activity. The data we have is dated messages. The purpose of this code is to find the earliest and latest days, so we know how far to range that timeline. Reindexing takes the data--which has the number messages per day, but only when there were positive messages (other days are not represented)--and gives it a new index which includes every day between the first and last messages, inclusive. For days that are not in the original data, it fills in a value of 0.

sbenthall commented 6 years ago

I am unable to replicate this error.

sbenthall commented 6 years ago

Reassigning to @npdoty because I'm not how to proceed, and he's encountered the error.

npdoty commented 6 years ago

I believe it is because of empty data. That error is of course very confusing and doesn't hint that the problem is that the Archive has no data in it. I've opened a PR to change behavior to throw an explicit MissingDataException so that we get a clear explanation of what the problem is and the user can handle it as they wish.

datactive / bigbang

ValueError in setting date range index in compute_activity (Example Notebook Corr Centrality Community) #246