Common-SenseMakers / sensemakers

Sensemakers infrastructure for developing AI-based tools for semantic annotations of social posts. Cross-poster app to publish your semantic posts on different networks.
GNU General Public License v3.0
1 stars 2 forks source link

[NLP] Filter bad URL parse #39

Closed ShaRefOh closed 4 months ago

ShaRefOh commented 5 months ago

The filter module has failed to parse the 3 following posts:

  1. Good by @kevinroose (gift link):Reddit’s I.P.O. Is a Content Moderation Success Story https://www.nytimes.com/2024/03/21/technology/reddit-ipo-public-content-moderation.html?unlocked_article_code=1.eU0.1L9u.fopauO4TeQxO&smid=tw-share
  2. FYI: I've talked to several journalists over the past couple of days about TikTok (e.g. here's a video, enjoy my dinosaur sweater https://www.reuters.com/world/us/us-house-vote-tiktok-crackdown-fate-uncertain-senate-2024-03-12/) and would be very happy to talk to more.
  3. The proposed TikTok ban working its way through Congress would likely embolden authoritarian censorship abroad, experts warn, and shatter the United States’ reputation as an international champion of free speech https://www.nbcnews.com/tech/tech-news/tiktok-ban-embolden-authoritarian-censorship-experts-warn-rcna143476
ronentk commented 5 months ago

@ShaRefOh can you add the mastodon urls for them please?

ShaRefOh commented 5 months ago

@ronentk I would, but for some reason, something is wrong with the link, perhaps it has something to do with the id being an int and is damaged somewhere until it gets to wandb, seen it before with arangoDB.

ronentk commented 5 months ago

hmm I think the dataset should have the mastodon urls as a column - you can do that, right? (eg regenerate the dataset with a new column) Maybe that should be a new issue

ShaRefOh commented 5 months ago

Yes I tested it and it seems that the id changes for some reason. It should be stringified in the dataset creation

ShaRefOh commented 4 months ago

The bad parse was made by the Mastodon dataset creator notebook