Closed synctext closed 2 years ago
A wiki-like approach? What would be editable within and outside Tribler?
Refocus this issue on torrent channels based on magnet links with rich metadata. Thesis goal is to deploy creation and search of rich metadata. More ambitious goal would be to do integrating of voting to obtain trustworthy metadata.
Each channel can only be modified by the channel owner. Users vote on the quality of channels and quality emerges from the Tribler user collective. It is easy to copy an entire channel, re-use quality content from a channel, and re-mix channels. Thus leading indirectly to crowdsourcing. We specifically avoid the problem of edit wars, collaborative editing, and undo changes. This leads to a realistic master project. Only with an active community in place, we can move to the next stage and experiment with more sophisticated crowdsourcing models.
The torrent channel is expanded with a new type: rich metadata channels. Channels owners indicate the content type of each magnet link. We still keep it simple: 1 magnet link has 1 rich metadata description. First step is to create a simple editor inside Tribler. It supports several content types. Combining enriching of music, podcast, movies, vloggers, scientific articles, etc. : Next step is then creating a new rich metadata search community.
This master thesis puts the foundations in place for rich metadata. In the future we want to integrate collaborative tools. For instance, we assume scientific papers are available in Tribler and we can semi-automatically create survey papers. Still out of scope:
Demonstrates the content type idea (radio button or drop down): Combining enriching of music, podcast, movies, vloggers, scientific articles, etc A lot of metadata frameworks exist, you could select this very old one. Keep data model very simple, never more then 4 fields per content type. We see how users like it, then expand based on real-world feedback. Please don't try to get it right the first time!
@devos50 Do you have time to walk through the stack this afternoon? I have a doctors appointment in an hour, so I'll be at the lab after lunch. I've seen most of the code, and have looked into QT last week, but it still is a bit fuzzy.
He is on vacation this week. But others should be able to help.
First step: adding a field containing a single metadata entry (content type in this case, based on the Youtube VideoCategories API), and starting to structure the metadata information. Started implementing a Metadata community, still bit fuzzy on what is needed there.
Got my community running, peers can exchange messages (directed and broadcast), next step is to define the behavior of the community and the message types that can be sent.
@synctext should I look at a more branching metadata structure such as here (inspired by youtube, piratebay and dublincore). The other option would be to flatten the structure and create a more generic metadata structure (not following the classic and modern frameworks), and create something more generic. What would you advise?
Next step would be to structure the database and messages sent.
@svanschooten I would advise to keep it as simple as possible for now and not go wild with many different metadata types/complicated structures yet.
Also, nice to see that you have a basic community up and running!
@devos50 welcome back! I agree, that is why I re-researched the desired structure, when I have something solid I'll implement a data structure which I can store in the database (also start implementing the distribution mechanics).
Due to a family crisis I have not been able to come to the lab Friday and today, but I have done some reading and thinking on the categorization issue: most content management systems use a tree based structure to define archetypes, subtypes and properties, though I have come across some interesting work. Twitter has published a content categorization method that looks interesting, though it is not directly applicable to our case.
Based on these articles and papers I have opted to design a more 'flat' category structure, which I have documented on my repository.
Write rich metadata on Trustchain? (e.g. so barter records, voting for channels, trading honesty, and metadata enrichment). Then we have 4 contexts of reputations to merge somewhat. Next step is to remove all non-blockchain data sync mechanisms in Tribler... Remove all storage in Dispersy #2778, all communities, and replace it with IPv8-based Trustchain storage.
Keep it simple-and-get-it-running-first-you-stupid model: only channel owner can do metadata enrichment :-)
Fixed my packing issue, added category based payload types and added them to the community communication. First working UI is done, now to couple the UI to the community: This includes basic parsing of datatypes based on regex (simplest method for now). Major overhaul to the ContentType and Category models to make them easily adaptable.
The fields are dynamically added with the accessory parsing method, field name and label. These are based on the fields defined in the Category models.
Only problem is that the community now can't discover the other peers, so the test script for the community does not receive anything... Test script checks is peer list is not empty each second, but stays empty. edit: stupid me used loop in reactor thread, derp...
Next (@synctext ??): writing metadata to persistence layer, more UI screens (only on torrent add for now) or better metadata models?
Looking at most metadata models, they approach it from an unstructured data angle, they usually have a (semi-) fixed tree structure for fields, but no simple and straightforward approaches to storing it using a relational database: This paper uses a generic field implementation with a mapping algorithm. If also found this paper using oldschool RDF. A patent that I can not understand.... RFC-ish description of how dublincore was designed. Theses guys developed a xml storing mechanism. This pretty decent explanation on how you should see and organise metadata.
I do not want to introduce more dependencies, but I think a noSQL storage method would be easiest? Or maybe something generic like:
MetadataTable:
- (id) ID
- (string) infohash
- (string) title
- (string) category
- (string) content type
FieldsTalbe:
- (id) ID
- (id) metadata ID
- (string) name
- (string) value
MetadataTable:
- (id) ID
- (string) infohash
Why do you want to make the infohash
more unique with an ID ? :-)
Consider adopting distributing scientific works as your test community for your entire thesis {or something else additionally; http://bt.etree.org}. Or create a tool and test how many hours it takes to put stuff like 400k scientific journals in your rich metadata. a.k.a. Giga-Scraper
idea. Next step: finish prototype and create a .pdf seeding channel.
Just make an music table, movie, clip, series, vlog, ebook, adult entertainment, other images table etc. Keep it simple for your 4-week prototype. try to remove content and subtype construct: just 1 category level. ID3 simple, for instance, no fancy XML, nosql, or RDF. Just a fixed structure please. Probably per content type. In 1996 Eric Kemp created ID3, the defacto framework for audio metadata. Strings are either space- or zero-padded. Unset string entries are filled using an empty string. ID3v1 is 128 bytes long. Table with fields is copied from Wikipedia
Field | Length | Description |
---|---|---|
header | 3 | "TAG" |
title | 30 | 30 characters of the title |
artist | 30 | 30 characters of the artist name |
album | 30 | 30 characters of the album name |
year | 4 | A four-digit year |
comment | 28 or 30 | The comment. |
zero-byte | 1 | If a track number is stored, this byte contains a binary 0. |
track | 1 | The number of the track on the album, or 0. Invalid, if previous byte is not a binary 0. |
genre | 1 | Index in a list of genres, or 255 |
ID3v1 pre-defines a set of genres denoted by numerical codes. Keeps it trivial...
Future: #3484 After this 4-week prototype is completed, explore more advanced architecture. We are prototyping using our Trustchain idea as the only storage paradigm in Tribler. It would contain: bandwidth barter transactions, voting for channels, trading of bandwidth coins #3326. Additionally, possibly rich metadata of channels; this thesis. Warning: this idea for yet another Tribler overhaul would take years to complete and get stable!
The underlying data model has bee simplified and abstracted more, to provide generic reading and setting handles. Is completely flat now. Also a basic database and repository implementation is done, both with an in-memory and persistent layer.
TODO:
Refined thesis subject: Searching in enriched metadata using deduplicated tag clouds.
It might be helpful for you to sync with @xoriole, your ideas seems to overlap somewhat.
@devos50 It was quite a long discussion. Lots of ideas floating. We'll see how the design materializes.
related to #6217
Allow any user to improve the metadata. Examples of existing approaches:
Time slicing is too heavy for Tribler, out of scope. Just metadata.