Closed computermacgyver closed 5 years ago
Update. Text fields possible containing a | must be enclosed within double quotation marks to be read easily by R. We will adopt this format.
Given the size of files, we will split files into hourly bins with names in the format YYYY-MM-DD_HH00_WW_Twitter_Spritzer.twt
For compression see #4
Most data on RedHen appears to be separated with pipes (
|
).For Twitter summary data, we can also use pipe to separate fields, but need to decide what to do with pipe characters that appear in tweets or free-text location fields. Should these be replaced with another character or escaped (i.e.,
\|
)?Currently summary files contain the text of tweets, which makes them quite large. Should they be compressed?
To date, I've used gzip compression, but it appears bz2 compression is slightly more efficient and results in smaller files. Any preference on which to use?
What should the final name of files be? I'm thinking
YYYY-MM-DD_0000_WW_Twitter_Spritzer.twt
. This would signify the date/time at which the data starts (roughly midnight UTC), that it is world-wide (WW), from Twitter, and a result of the Sprtizer sample. Other datasets would have a unique identifier in place of 'Spritzer'