Open gedankenstuecke opened 6 years ago
@gedankenstuecke may I work on this?
Absolutely! 👍
@gedankenstuecke I was thinking on this issue and realized I have doubts on how to test the data loading and analyzing process. The create_main_dataframe
function needs a twitter zip archive to be tested, and the functions in analyse_data.py
all need the created dataframe as a parameter to be tested, so I think it comes down to where a twitter archive file should be stored. I see two alternatives:
The tests would be very general and able to pass with any twitter archive that the dev places in a specific directory, so the test data should be downloaded separately after cloning the repository, but I'm not sure how one could test specific analysis return values from potentially different archives.
The tests would be archive-specific (for example, depending on test_archive.zip
). If this were the case I see three ways to go: The user could be instructed in the documentation to download the file from a static URL, the file itself could be uploaded to the repository, or a new script could download the archive from a static URL only once, before tests can be run.
In a project I once did we placed the test data inside the code repository, but I'm not sure if that'd fit the idea behind twitter-analyser, so I wanted to know more before doing a pull request.
@jobliz I think having an archive specific test would be okay with me.
To keep the file size (and test duration) manageable one could just minimize the data in the test_archive.zip
to something like two months worth of data (e.g. take december 2008 and january 2009, with that it also includes the yearly break).
Does that sound good?
Sure! I'll create a minimized archive and do new tests with it.
Awesome, thanks!
I made a minified archive file by deleting JSON files in the /tweets
directory and manually setting a new tweet total in payload_details.js
, then tried to run the analysis functions on the resulting dataframe. Unexpectedly, two functions raised exceptions. test_create_timeline
raises NotImplementedError: Not supported for type RangeIndex
and create_hourly_stats
raises KeyError: 'Weekday'
. This doesn't happen with the full archive (and yes, testing does take way too long to be practical with all tweets). I tried the same tests with a 4 month archive including November 2008 and February 2009 too, but it also raised the same exceptions.
Given that I'm not entirely sure why this is happening in the code and the tests aren't really passing I'm not yet making a PR, but the code is in a branch in my fork.
Do you have any idea why a minimized archive might raise exceptions?
I haven't found time to look closely into the data. But I suspect that both create_hourly_stats
and create_timeline
might crash due to a lack of GPS data in the 2008/2009 data, as it only works for tweets that have latitude
& longitude
to find out the time zone/have a lat/long to plot on the map later on.
Looking at my own data it seems that geolocations on Twitter became available only at mid-2011. So probably I just made a bad recommendation on which data to pick for the test set.
This would also explain why it doesn't happen for the full data set, as we do have GPS coordinates in the full archive.
That makes sense. I will make a minimized archive from months later than mid-2011 and see how it goes. If I'm getting it right then these crashes will also happen when loading full archives from people that haven't turned on location data. When I receive my archive from Twitter I will check if it does to be sure.
Great, thanks so much!
The old problem :P