gedankenstuecke / twitter-analyser

export data from twitter archive and visualize it
http://twarxiv.org
MIT License
25 stars 11 forks source link

Write tests #3

Open gedankenstuecke opened 6 years ago

gedankenstuecke commented 6 years ago

The old problem :P

troublemagnet commented 6 years ago

@gedankenstuecke may I work on this?

gedankenstuecke commented 6 years ago

Absolutely! 👍

jobliz commented 6 years ago

@gedankenstuecke I was thinking on this issue and realized I have doubts on how to test the data loading and analyzing process. The create_main_dataframe function needs a twitter zip archive to be tested, and the functions in analyse_data.py all need the created dataframe as a parameter to be tested, so I think it comes down to where a twitter archive file should be stored. I see two alternatives:

In a project I once did we placed the test data inside the code repository, but I'm not sure if that'd fit the idea behind twitter-analyser, so I wanted to know more before doing a pull request.

gedankenstuecke commented 6 years ago

@jobliz I think having an archive specific test would be okay with me.

To keep the file size (and test duration) manageable one could just minimize the data in the test_archive.zip to something like two months worth of data (e.g. take december 2008 and january 2009, with that it also includes the yearly break).

Does that sound good?

jobliz commented 6 years ago

Sure! I'll create a minimized archive and do new tests with it.

gedankenstuecke commented 6 years ago

Awesome, thanks!

jobliz commented 6 years ago

I made a minified archive file by deleting JSON files in the /tweets directory and manually setting a new tweet total in payload_details.js, then tried to run the analysis functions on the resulting dataframe. Unexpectedly, two functions raised exceptions. test_create_timeline raises NotImplementedError: Not supported for type RangeIndex and create_hourly_stats raises KeyError: 'Weekday'. This doesn't happen with the full archive (and yes, testing does take way too long to be practical with all tweets). I tried the same tests with a 4 month archive including November 2008 and February 2009 too, but it also raised the same exceptions.

Given that I'm not entirely sure why this is happening in the code and the tests aren't really passing I'm not yet making a PR, but the code is in a branch in my fork.

Do you have any idea why a minimized archive might raise exceptions?

gedankenstuecke commented 6 years ago

I haven't found time to look closely into the data. But I suspect that both create_hourly_stats and create_timeline might crash due to a lack of GPS data in the 2008/2009 data, as it only works for tweets that have latitude & longitude to find out the time zone/have a lat/long to plot on the map later on.

Looking at my own data it seems that geolocations on Twitter became available only at mid-2011. So probably I just made a bad recommendation on which data to pick for the test set.

This would also explain why it doesn't happen for the full data set, as we do have GPS coordinates in the full archive.

jobliz commented 6 years ago

That makes sense. I will make a minimized archive from months later than mid-2011 and see how it goes. If I'm getting it right then these crashes will also happen when loading full archives from people that haven't turned on location data. When I receive my archive from Twitter I will check if it does to be sure.

gedankenstuecke commented 6 years ago

Great, thanks so much!