digitalmethodsinitiative / zeeschuimer

A browser extension to collect social media data with.
Other
184 stars 14 forks source link

Adding ingest module for Gab #21

Closed Parker-Kasiewicz closed 8 months ago

Parker-Kasiewicz commented 1 year ago

Adding a module that records data from the Gab browser. Can grab posts from searching, the explore page, a group, or a user's account page. Can also grab data about groups & users' accounts. For whatever reason, posts found in search have a different format than other posts and seem to include more information about the accompanying user in the post itself, although normal posts still include the user id. Currently only storing the main fields.

Hope this is useful :)

stijn-uva commented 11 months ago

Hi @Parker-Kasiewicz, thanks so much for this contribution. I think Gab is certainly a platform we could add support for. The module as currently implemented does deviate a bit from our vision for the extension, which is that modules more or less collect one kind of item, for example TikTok posts, Instagram posts (but not stories), et cetera.

The code in this PR additionally collects groups and users. I can certainly see a use case for that, but it can be a bit confusing for users to have multiple types of data in a single dataset, and it would also make it more complicated to export the data to 4CAT, which is an important use case for us. 4CAT could just ignore some of the data, but then that would also be confusing as so far it has always been possible to just export the 'whole' dataset.

I think to merge this the module would need to limit itself to just posts (from search or elsewhere). Potentially the rest could be in a separate module, but before we go down that road I think we'd need to make some changes to the interface to e.g. allow multiple modules per platform, and that's not something we've quite figured out yet. So for now Zeeschuimer modules will need to limit themselves to ~one type of data, unfortunately.

Parker-Kasiewicz commented 11 months ago

Hi @stijn-uva, thanks for the feedback - makes sense! I updated the module so that it only collects posts, and the group, account, and media information is included in the same data type. I think it aligns closely with the other modules, and I've tested it out in 4CAT, where it is able to grab everything (as far as I can tell)!

Making a corresponding PR to 4CAT right now so that you can try it out for yourself! Let me know if there are any issues :)

Parker-Kasiewicz commented 10 months ago

Hi team, just wanted to check-in on the status of this PR ... working on corresponding modules for Truth Social for Zeeschuimer and 4CAT, so I'd love to get feedback on the ones for Gab before I submit those!

stijn-uva commented 10 months ago

Hi @Parker-Kasiewicz, thanks for the reminder! I thought I had left a new comment on this, but clearly I didn't, apologies for that. The code as is does some work on the captured objects before storing them for export to 4CAT or downloading as NDJSON. For our other modules, we have generally tried to store the data mostly as-is, and do the transformation in 4CAT instead.

This has the advantage of Zeeschuimer still working even if Gab's (in this case) data's format changes, and in that case we would only need to update 4CAT. We do not always know what people using Zeeschuimer would be interested in so by just storing the otherwise unprocessed data objects we make sure that the data is always still there if someone is interested in it, even if it is an otherwise obscure field that we might not consider for inclusion in a stored data object at the moment. 4CAT's TikTok processor has added new columns a few times based on suggestions from researchers using it for example, and this way it will have the 'new' data for older datasets too, because it was always present in the underlying JSON originally provided by TikTok.

My own approach would thus have been to just have Zeeschuimer store post, and put the code that now transforms it into transformedPost in 4CAT instead. So I would probably request that the code is refactored that way, but perhaps there are reasons to do it this way for Gab specifically - I am not familiar enough with the data to know. In any case, let me know what you think.

Parker-Kasiewicz commented 9 months ago

Hi @stijn-uva, thanks for the feedback! What you said totally makes sense. As such, I've updated the code to eliminate all unnecessary processing. The way Gab stores data for the authors of posts, as well as links & images, I have to do a little bit of processing to add them to their respective posts, but all I'm doing is throwing in the underlying JSON.

I've also updated my respective 4CAT pull with regards to these changes, so the processing all works in 4CAT as well. Let me know what you think!

Thanks, Parker

stijn-uva commented 9 months ago

Thanks for this update, this is more in line with how other modules are set up 👍

I noticed that search results (e.g. https://gab.com/search/top?q=amsterdam) are not captured, is this by design? Search is mentioned as supported in an earlier comment, but perhaps this changed after your recent refactoring?

Parker-Kasiewicz commented 9 months ago

Just an oversight on my end, fixed now! Will adjust the 4CAT code accordingly if this is the final version for Zeeschuimer, so let me know @stijn-uva

stijn-uva commented 8 months ago

Thanks! I'll merge this now, the 4CAT counterpart looks mostly ready apart from a few small notes that you can find over there.