DocNow / twarc

A command line tool (and Python library) for archiving Twitter JSON
https://twarc-project.readthedocs.io
MIT License
1.36k stars 255 forks source link

Plugin for ActivityStreams? #412

Open edsu opened 3 years ago

edsu commented 3 years ago

For twarc2 plugins to be able to work with v1.1 and v2 Twitter JSON data I wonder if it might be useful to create a plugin that provides a standard JSON representation of a tweet using ActivityStreams? This plugin could then be used by other plugins that want to be able to work with both v1 and v2 JSON representations of a tweet.

We've talked recently about using a SQLite database as an interim representation for plugins to use. I think SQLite definitely has value, especially for querying and processing data as part of the analysis. But many utilities have processed data as a stream, so I think it might make sense to have a single representation of a tweet. In fact a SQLite plugin could use this standardized representation of a tweet in order to work with v1 and v2 Twitter data. I think the conversion to ActivityStreams would potentially be a lossy transformation, but it might be a worthwhile experiment?

edsu commented 3 years ago

If the experiment is successful it might even be worth having in the core twarc package, so plugins could use it from there without an additional dependency.

edsu commented 3 years ago

A potential command line interaction:

twarc2 search auspol > search.jsonl
twarc2 activitystream search.jsonl > search-activitystream.jsonl
SamHames commented 3 years ago

I'm not keen on creating an intermediate format for the sake of an intermediate format: I'd prefer to focus on the original data as the point of interoperability. That being said if it falls out nicely from some of the other work (like around the data ingest process for SQLite), and we have interest from other projects then it would be a nice thing to have on top of everything else.

edsu commented 3 years ago

Ok, thanks for weighing in @SamHames. I'll pursue this one on my own I think. Maybe post v2 launch we could pursue two plugins concurrently twarc-sqlite and twarc-activitystreams? I think it will likely get tedious to have the former work with v1 and v2 data without some kind of intermediate mapping of some kind, but I might be wrong. I also suspect that there will still be people who want to work with tweets as a stream without requiring the assembly of an intermediate sqlite database, especially people who are importing to another type of database.