Difference between original dataset, ctshares and ct_shares_marked?

PiyushKyushu commented 2 years ago

Hi,

I have a dataset of 13622 rows and 41 columns which I collected from Crowdtangle historical data option. I used this dataset for finding out Coordinated link sharing behaviour using CooRnet.

The output include ctshares with 98,731 rows and 35 columns and ct_shares_marked with 13109 rows and 37 columns.

I don't understand how and why these three dataset (my original dataset, ctshares, and ct_shares_marked) are different from each other.

I find one already closed issue https://github.com/fabiogiglietto/CooRnet/issues/22 that is about ctshares and ct_shares_marked but I don't understand what is been said there.

It would be very helpful if you can explain the differences between three dataset.

Thank you in Advance!

fabiogiglietto commented 2 years ago

Hi :) did you used the get_urls_from_ct_histdata to extract the URLs from CrowdTangle's CSV list of posts? If so, you should also have a number of URLs that you started from. CooRnet collects all the shares of these URLs and stores it in the ct_shares.df. Unlike your original CSV, this is a list of posts that shared your URLs on the entire platform tracked by CrwodTangle. The ct_shares_marked is created by get_coord_shares as part of its outputs. The ct_shares_marked dataframe includes two additional field (is_coordinated and is_orig) and only includes the posts related to links that were shared at least two times (this is the reason ct_shares_marked is smaller than ct_shares.

Hope I've answered your questions.

Best, Fabio

PiyushKyushu commented 2 years ago

Hi Fabio,

Thank you for the explanation. It is very helpful and the answer certainly enhance my understanding about ctshares and ct_shares_marked. The only thing I am still confused about is url.

I collected Facebook posts from Crowdtangle (my_data_original.csv) = 14,656

Then I used below code:

urls <- get_urls_from_ct_histdata(ct_histdata_csv=urls <- get_urls_from_ct_histdata(ct_histdata_csv="my_data_original.csv")

This produced a list of URLs with date = 7448

did you used the get_urls_from_ct_histdata to extract the URLs from CrowdTangle's CSV list of posts? If so, you should also have a number of URLs that you started from.

As you mentioned, this means that the URLs which I got (7448) came from my dataset (my_data_original.csv).

From where this number came? what are these URLs? I mean if I inspect my dataset manually where (under which field) I can found them?

In other words from which field CooRnet extract URLs? Is it URL or Link or any other field?

I apologise if my question seems naive or insignificant.

Thank you for the cooperation.

fabiogiglietto commented 2 years ago

It's actually a good question on a poorly documented part of the package. The get_urls_from_ct_histdata attempts extracting all the links from the CrowdTangle CSV of posts. The process is performed in lines 71 to 76 of the function code (https://github.com/fabiogiglietto/CooRnet/blob/master/R/get_urls_from_ct_histdata.R).

In other terms: 1) Starts from Final Link 2) If Final Link is empty gets what available in Link 3) If the post is marked as a re-share in Link Text, the Link field is used 4) If the post is not a Link Type Post, the function attempts extracting the link if any from the Message and Description field

Please also note that CrowdTangle API link endpoint may return a post that include multiple links. This means that you may end up with link referenced in ct_shares.df that are not in your original list. To avoid using these links use set to TRUE the is_orig parameter in get_coord_shares.

Best, Fabio

PiyushKyushu commented 2 years ago

Thank you so much for the explanation Fabio.

fabiogiglietto / CooRnet

Difference between original dataset, ctshares and ct_shares_marked? #27