DocNow / twarc

A command line tool (and Python library) for archiving Twitter JSON
https://twarc-project.readthedocs.io
MIT License
1.36k stars 255 forks source link

Rolling over files #496

Open AbirRes opened 3 years ago

AbirRes commented 3 years ago

I am using twarc2 to stream tweets. I plan on keeping the stream open for a couple of weeks. I want to be able to work on the files during the collection process and also to handle any storage issues. Therefore, I am looking for a way for the tweets to start getting recorded in a new file when a certain size (e.g., 2 or 3 GB) is reached or after a specified amount of time, without breaking the stream. @edsu suggested a split command, but that, unfortunately, only works on Mac and Linux and not on Windows. Any suggestions or support would be much appreciated.

igorbrigadir commented 3 years ago

This is a great idea for a new feature.

For your immediate problem, (I have not tested if this works yet) you can maybe try https://www.laptopmag.com/articles/use-bash-shell-windows-10 installing the linux subsystem on windows and then run these commands to install python and twarc:

sudo apt update && sudo apt upgrade
sudo apt install python3 python3-pip ipython3
pip3 install twarc
twarc2 configure

Now you should be able to run commands as if they were linux commands, on windows.

igorbrigadir commented 3 years ago

As for a potential new twarc command for this:

First thing that comes to mind is having an extra command that takes a piped input, and writes to compressed or uncompressed files, i remember having a simple one here https://github.com/igorbrigadir/covid19-twitter-stream-tool/blob/master/src/stream/stream.py#L21-L47 (but for twarc xz is not a good choice because of cross platform compatibility and issues with xz for archiving)

twarc2 search "foo" | twarc2 output --file-rotate "hourly" --file-name "_example_pattern.jsonl" --compress

something like that maybe. Or adding extra parameters to existing methods?

twarc2 search --archive --split-files "daily" "foo" output.jsonl

something like that?

AbirRes commented 3 years ago

This is a great idea for a new feature.

For your immediate problem, (I have not tested if this works yet) you can maybe try https://www.laptopmag.com/articles/use-bash-shell-windows-10 installing the linux subsystem on windows and then run these commands to install python and twarc:

sudo apt update && sudo apt upgrade
sudo apt install python3 python3-pip ipython3
pip3 install twarc
twarc2 configure

Now you should be able to run commands as if they were linux commands, on windows.

Thank you for the suggestions. I could never have thought of this myself.

edsu commented 3 years ago

I think twarc2 search --archive --split-files "daily" "foo" output.jsonl would be great to see. But I think it could be useful in other contexts such as filter. It might not be hard to reuse the code and command line arguments, but I wonder if it might be better to add it as a separate command that you could pipe to, and also use on previously collected data?

So something like:

twarc2 search blm output.jsonl
twarc2 bin --daily output.jsonl

or:

twarc2 search blm | twarc2 bin --daily

or:

twarc2 stream | twarc2 bin --daily

And maybe binning could work on other things, like filter tags?

twarc2 stream | twarc2 bin --tag

What is missing here is how to indicate what path and filename prefix to use when binning. Maybe there's another old unix utility that has a good interface we can borrow from?

If this seemed like a viable way forward I guess we'd want to make sure the approach works on Windows. I would want to make sure that old difficulty with Windows BOM didn't surface again.

The other risk is that people would need to conceptually understand pipes, which expects users of twarc to be a bit of a sysadmin. But they are already at the command line, so expecting a little bit more doesn't seem unfair? A teachable moment perhaps?

mr-devs commented 1 year ago

Would just like to throw my hat into this issue and comment on how extremely useful this would be. I've been searching for an hour to figure out how to do this for a long-running filtered stream.

Actually, this is a major reason why my colleagues and I typically stream tweets with a python package as opposed to from the command line (currently I am trying to migrate all data collection to twarc2). Our pipelines often require that we utilize the data gathered from the last day, which means we would have to break the stream, given twarc2's current form, and potentially miss data (please do let me know if there is another way!).

For what it's worth, my favorite syntax mentioned so far would be this:

twarc2 stream --split-files "daily" "stream.json"

and then output files could be written like "YYYY_mm_dd_stream.json".

But a bin function would probably also be very useful. As @edsu mentioned...

twarc2 search blm output.jsonl
twarc2 bin --daily output.jsonl
edsu commented 1 year ago

@mr-devs you probably already know this but if you need to do this in a pinch on Unix you can use the split command. For example to split the sample stream into files that are 1000 lines each:

twarc2 sample | split -l 1000 - sample.

I think this might be easiest to implement as a twarc-split plugin which you could use as part of a pipeline or on pre-existing files:

twarc stream | twarc split --daily
mr-devs commented 1 year ago

Thanks, @edsu !! I was not aware so this is extremely helpful — sort of learning as I go through the twarc2 migration 🤣😅

edsu commented 1 year ago

Sorry if I misled you there @mr-devs -- the twarc-split plugin doesn't exist yet, but the Unix split command does!

igorbrigadir commented 1 year ago

Gonna link a related discussion here for future reference: https://github.com/DocNow/twarc/discussions/641

twarc-split as a general purpose plugin is a good potential solution. But in the mean time, standard command line tools for splitting by lines may be a good stopgap while we don't have a better thing: https://stackoverflow.com/a/2016918/11090908