Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.87k stars 451 forks source link

crowdsourcing Metadata #2455

Closed synctext closed 2 years ago

synctext commented 8 years ago

Allow any user to improve the metadata. Examples of existing approaches:

image

image

image

Time slicing is too heavy for Tribler, out of scope. Just metadata. image

lfdversluis commented 8 years ago

A wiki-like approach? What would be editable within and outside Tribler?

synctext commented 6 years ago

Refocus this issue on torrent channels based on magnet links with rich metadata. Thesis goal is to deploy creation and search of rich metadata. More ambitious goal would be to do integrating of voting to obtain trustworthy metadata.

Each channel can only be modified by the channel owner. Users vote on the quality of channels and quality emerges from the Tribler user collective. It is easy to copy an entire channel, re-use quality content from a channel, and re-mix channels. Thus leading indirectly to crowdsourcing. We specifically avoid the problem of edit wars, collaborative editing, and undo changes. This leads to a realistic master project. Only with an active community in place, we can move to the next stage and experiment with more sophisticated crowdsourcing models.

The torrent channel is expanded with a new type: rich metadata channels. Channels owners indicate the content type of each magnet link. We still keep it simple: 1 magnet link has 1 rich metadata description. First step is to create a simple editor inside Tribler. It supports several content types. Combining enriching of music, podcast, movies, vloggers, scientific articles, etc. : image image image Next step is then creating a new rich metadata search community.

This master thesis puts the foundations in place for rich metadata. In the future we want to integrate collaborative tools. For instance, we assume scientific papers are available in Tribler and we can semi-automatically create survey papers. Still out of scope: image

synctext commented 6 years ago

Demonstrates the content type idea (radio button or drop down): image Combining enriching of music, podcast, movies, vloggers, scientific articles, etc A lot of metadata frameworks exist, you could select this very old one. Keep data model very simple, never more then 4 fields per content type. We see how users like it, then expand based on real-world feedback. Please don't try to get it right the first time!

svanschooten commented 6 years ago

@devos50 Do you have time to walk through the stack this afternoon? I have a doctors appointment in an hour, so I'll be at the lab after lunch. I've seen most of the code, and have looked into QT last week, but it still is a bit fuzzy.

synctext commented 6 years ago

He is on vacation this week. But others should be able to help.

svanschooten commented 6 years ago

First step: adding a field containing a single metadata entry (content type in this case, based on the Youtube VideoCategories API), and starting to structure the metadata information. Started implementing a Metadata community, still bit fuzzy on what is needed there. tribler-content-types

svanschooten commented 6 years ago

Got my community running, peers can exchange messages (directed and broadcast), next step is to define the behavior of the community and the message types that can be sent.

@synctext should I look at a more branching metadata structure such as here (inspired by youtube, piratebay and dublincore). The other option would be to flatten the structure and create a more generic metadata structure (not following the classic and modern frameworks), and create something more generic. What would you advise?

Next step would be to structure the database and messages sent.

devos50 commented 6 years ago

@svanschooten I would advise to keep it as simple as possible for now and not go wild with many different metadata types/complicated structures yet.

Also, nice to see that you have a basic community up and running!

svanschooten commented 6 years ago

@devos50 welcome back! I agree, that is why I re-researched the desired structure, when I have something solid I'll implement a data structure which I can store in the database (also start implementing the distribution mechanics).

Due to a family crisis I have not been able to come to the lab Friday and today, but I have done some reading and thinking on the categorization issue: most content management systems use a tree based structure to define archetypes, subtypes and properties, though I have come across some interesting work. Twitter has published a content categorization method that looks interesting, though it is not directly applicable to our case.

Based on these articles and papers I have opted to design a more 'flat' category structure, which I have documented on my repository.

synctext commented 6 years ago

1150 is about to start soon. First finish this quick MAX 4 week prototype, then think how to build on top of scalable channels. When the're hopefully ready!

Write rich metadata on Trustchain? (e.g. so barter records, voting for channels, trading honesty, and metadata enrichment). Then we have 4 contexts of reputations to merge somewhat. Next step is to remove all non-blockchain data sync mechanisms in Tribler... Remove all storage in Dispersy #2778, all communities, and replace it with IPv8-based Trustchain storage.

Keep it simple-and-get-it-running-first-you-stupid model: only channel owner can do metadata enrichment :-)

svanschooten commented 6 years ago

Fixed my packing issue, added category based payload types and added them to the community communication. First working UI is done, now to couple the UI to the community: This includes basic parsing of datatypes based on regex (simplest method for now). Major overhaul to the ContentType and Category models to make them easily adaptable.

metadata_v1 The fields are dynamically added with the accessory parsing method, field name and label. These are based on the fields defined in the Category models.

Only problem is that the community now can't discover the other peers, so the test script for the community does not receive anything... Test script checks is peer list is not empty each second, but stays empty. edit: stupid me used loop in reactor thread, derp...

svanschooten commented 6 years ago

Next (@synctext ??): writing metadata to persistence layer, more UI screens (only on torrent add for now) or better metadata models?

svanschooten commented 6 years ago

Looking at most metadata models, they approach it from an unstructured data angle, they usually have a (semi-) fixed tree structure for fields, but no simple and straightforward approaches to storing it using a relational database: This paper uses a generic field implementation with a mapping algorithm. If also found this paper using oldschool RDF. A patent that I can not understand.... RFC-ish description of how dublincore was designed. Theses guys developed a xml storing mechanism. This pretty decent explanation on how you should see and organise metadata.

I do not want to introduce more dependencies, but I think a noSQL storage method would be easiest? Or maybe something generic like:

MetadataTable:
- (id) ID
- (string) infohash
- (string) title
- (string) category
- (string) content type

FieldsTalbe:
- (id) ID
- (id) metadata ID
- (string) name
- (string) value
synctext commented 6 years ago

MetadataTable:

  • (id) ID
  • (string) infohash

Why do you want to make the infohash more unique with an ID ? :-)

Consider adopting distributing scientific works as your test community for your entire thesis {or something else additionally; http://bt.etree.org}. Or create a tool and test how many hours it takes to put stuff like 400k scientific journals in your rich metadata. a.k.a. Giga-Scraper idea. Next step: finish prototype and create a .pdf seeding channel.

Just make an music table, movie, clip, series, vlog, ebook, adult entertainment, other images table etc. Keep it simple for your 4-week prototype. try to remove content and subtype construct: just 1 category level. ID3 simple, for instance, no fancy XML, nosql, or RDF. Just a fixed structure please. Probably per content type. In 1996 Eric Kemp created ID3, the defacto framework for audio metadata. Strings are either space- or zero-padded. Unset string entries are filled using an empty string. ID3v1 is 128 bytes long. Table with fields is copied from Wikipedia

Field Length Description
header 3 "TAG"
title 30 30 characters of the title
artist 30 30 characters of the artist name
album 30 30 characters of the album name
year 4 A four-digit year
comment 28 or 30 The comment.
zero-byte 1 If a track number is stored, this byte contains a binary 0.
track 1 The number of the track on the album, or 0. Invalid, if previous byte is not a binary 0.
genre 1 Index in a list of genres, or 255

ID3v1 pre-defines a set of genres denoted by numerical codes. Keeps it trivial...

Future: #3484 After this 4-week prototype is completed, explore more advanced architecture. We are prototyping using our Trustchain idea as the only storage paradigm in Tribler. It would contain: bandwidth barter transactions, voting for channels, trading of bandwidth coins #3326. Additionally, possibly rich metadata of channels; this thesis. Warning: this idea for yet another Tribler overhaul would take years to complete and get stable!

svanschooten commented 6 years ago

The underlying data model has bee simplified and abstracted more, to provide generic reading and setting handles. Is completely flat now. Also a basic database and repository implementation is done, both with an in-memory and persistent layer.

TODO:

svanschooten commented 6 years ago

Refined thesis subject: Searching in enriched metadata using deduplicated tag clouds.

image

devos50 commented 6 years ago

It might be helpful for you to sync with @xoriole, your ideas seems to overlap somewhat.

xoriole commented 6 years ago

@devos50 It was quite a long discussion. Lots of ideas floating. We'll see how the design materializes.

ichorid commented 3 years ago

related to #6217

synctext commented 2 years ago

Related: tagging and RSS historical analysis plus failings