Open kouloumos opened 2 weeks ago
@kouloumos for mapping terminology, let's say for Mailing Lists (bitcoin dev) -
Source
would like Bitcoin DevResource
would be the first email (the beginning of the email thread)Item
would be all the emails in that particular email threads (all emails under common title/subject).Right?
Context and Current Terminology
We currently use inconsistent terminology across different parts of our data infrastructure, especially in how we describe content from various sources (forums, mailing lists, transcripts, etc.). This inconsistency complicates understanding the system, maintaining the codebase, and onboarding new team members.
Issues with Current Terminology:
Proposal for Standardized Terminology
To reduce confusion and create a clearer, more maintainable system, we can use the following standardized terms for the structure of our data infrastructure.
Proposed Terminology:
Source:
Resource:
Item (Optional Chunking):
Why This Matters
Mapping the Terminology
To further illustrate the proposal, here’s how different sources will be structured under this new terminology:
Addressing Metadata and Resource Reference Issues
Metadata Inconsistencies:
We have inconsistencies in the way metadata is defined across scrapers, particularly in the
type
andthread_url
fields:type
field:type="topic"
if message_number ==#1
, elsetype="post"
type="topic"
for documents in_topics/en
,type="post"
for documents in_posts/en
type="answer"
ortype="question"
type="original_post"
ortype="reply"
type
usedthread_url
field:To resolve this, we will adopt consistent metadata field definitions across all sources, ensuring that a
type
field and a consistent way to reference resources are applied uniformly.Resource Reference Problem:
One key issue is the lack of a clear way to refer to a Resource. Currently, we treat the first post or element of a resource (e.g., the first post in a thread) as a reference point, but this leads to problems. For instance:
By defining a Resource as the main reference point, we create a clear structure for referring to the full entity (e.g., the thread as a whole, rather than just its first post). This approach allows us to handle ranking algorithms, mappings across resources, and summaries more effectively, without the added complexity of treating parts of a resource as a substitute for the whole.
Next Steps
type
andthread_url
.