fabiogiglietto / CooRnet

Given a set of URLs, this packages detects coordinated link sharing behavior on social media and outputs the network of entities that performed such behaviour.
MIT License
74 stars 15 forks source link

Difference between original dataset, ctshares and ct_shares_marked? #27

Closed PiyushKyushu closed 2 years ago

PiyushKyushu commented 2 years ago

Hi,

I have a dataset of 13622 rows and 41 columns which I collected from Crowdtangle historical data option. I used this dataset for finding out Coordinated link sharing behaviour using CooRnet.

The output include ctshares with 98,731 rows and 35 columns and ct_shares_marked with 13109 rows and 37 columns.

I don't understand how and why these three dataset (my original dataset, ctshares, and ct_shares_marked) are different from each other.

I find one already closed issue https://github.com/fabiogiglietto/CooRnet/issues/22 that is about ctshares and ct_shares_marked but I don't understand what is been said there.

It would be very helpful if you can explain the differences between three dataset.

Thank you in Advance!

fabiogiglietto commented 2 years ago

Hi :) did you used the get_urls_from_ct_histdata to extract the URLs from CrowdTangle's CSV list of posts? If so, you should also have a number of URLs that you started from. CooRnet collects all the shares of these URLs and stores it in the ct_shares.df. Unlike your original CSV, this is a list of posts that shared your URLs on the entire platform tracked by CrwodTangle. The ct_shares_marked is created by get_coord_shares as part of its outputs. The ct_shares_marked dataframe includes two additional field (is_coordinated and is_orig) and only includes the posts related to links that were shared at least two times (this is the reason ct_shares_marked is smaller than ct_shares.

Hope I've answered your questions.

Best, Fabio

PiyushKyushu commented 2 years ago

Hi Fabio,

Thank you for the explanation. It is very helpful and the answer certainly enhance my understanding about ctshares and ct_shares_marked. The only thing I am still confused about is url.

I collected Facebook posts from Crowdtangle (my_data_original.csv) = 14,656

Then I used below code:

urls <- get_urls_from_ct_histdata(ct_histdata_csv=urls <- get_urls_from_ct_histdata(ct_histdata_csv="my_data_original.csv")

This produced a list of URLs with date = 7448

did you used the get_urls_from_ct_histdata to extract the URLs from CrowdTangle's CSV list of posts? If so, you should also have a number of URLs that you started from.

As you mentioned, this means that the URLs which I got (7448) came from my dataset (my_data_original.csv).

From where this number came? what are these URLs? I mean if I inspect my dataset manually where (under which field) I can found them?

In other words from which field CooRnet extract URLs? Is it URL or Link or any other field?

I apologise if my question seems naive or insignificant.

Thank you for the cooperation.

fabiogiglietto commented 2 years ago

It's actually a good question on a poorly documented part of the package. The get_urls_from_ct_histdata attempts extracting all the links from the CrowdTangle CSV of posts. The process is performed in lines 71 to 76 of the function code (https://github.com/fabiogiglietto/CooRnet/blob/master/R/get_urls_from_ct_histdata.R).

In other terms: 1) Starts from Final Link 2) If Final Link is empty gets what available in Link 3) If the post is marked as a re-share in Link Text, the Link field is used 4) If the post is not a Link Type Post, the function attempts extracting the link if any from the Message and Description field

Please also note that CrowdTangle API link endpoint may return a post that include multiple links. This means that you may end up with link referenced in ct_shares.df that are not in your original list. To avoid using these links use set to TRUE the is_orig parameter in get_coord_shares.

Best, Fabio

PiyushKyushu commented 2 years ago

Thank you so much for the explanation Fabio.