buckket / twtxt

Decentralised, minimalist microblogging service for hackers.
http://twtxt.readthedocs.org/en/stable/
MIT License
1.91k stars 79 forks source link

How to store metadata about a feed #48

Open Benaiah opened 8 years ago

Benaiah commented 8 years ago

A number of different issues and ideas have made clear the need for a place to specify metadata about a twtxt.txt feed. For instance, essentially every idea for notifications so far needs to know where the notifications should go (technical details vary based on the proposal). The question then is how to store metadata.

Discussion in #22 has suggested a general comment character, thus allowing clients to handle individually how the metadata would be stored. I suggest building on this, allowing for general comments, but make the following format specifically for metadata:

# this is a regular comment

# the next line is a metadata entry
# nick = benaiah

This echoes the .ini format of the twtxt config file, which I think gives it a nice consistency.

The other main suggestion for metadata is to have another file. I dislike this approach because it complicates the protocol, significantly increases how much twtxt has to hit the network, and requires either a second URL for each person (for the metadata file), switching twtxt.txt to hold metadata and having another file hold the feed, or putting a metadata entry in twtxt.txt that points to the metadata file.

tedder commented 8 years ago

pros of second file:

One file has the advantage of showing metadata that changes- for instance "added new profile pic on date" or "followed @ on date", if we use syntax that is similar to the non-commented version:

# [date] \t key = value
reednj commented 8 years ago

I like the idea of doing this in comments at the top of the file. I think the advantages of having everything in the same file outweighs any added complexity when swapping out the files if they get too big or whatever.

However, I think we would quickly hit limitations with simple key value system - how would you easily store a list of follows with this for example?

A good format could be yaml, I think. Its human readable and writable, and widely supported - we would just need to strip out the comment character at the start of each line before parsing it.

I imagine the header for twtxt would then look something like this:

# the three dashes indicate the start of the data block, so we know where
# to start converting to yaml
# ---
# username: reednj
# following: 
#  - buckket http://buckket.org/twtxt.txt
#  - xena https://xena.greedo.xeserv.us/files/xena.txt
#  - whatever http://whatever.com/twtxt.txt

Edit: somehow forgot to add the urls to the user list...

erlehmann commented 8 years ago

@tedder consider that if you use a separate file for metadata and it also supports including messages you quickly obsolete the twtxt format as the syndication format of choice. Every client will just use the format that provides more data. Thus, the original twtxt format would be mainly useful as input (like Markdown or ReStructured Text) for scripts generating feeds.

tedder commented 8 years ago

@erlehmann I said nothing about including messages in a second file.

@reednj I like the idea of metadata at the top, instead of happening anywhere in twtxt. I (personally) like yml, it's extensible in cases like this.

erlehmann commented 8 years ago

@tedder to demonstrate: Yeah, you do not have to include messages. But any format that is powerful enough to include the metadata can be utilized for that and then you are back at using a single file. I have written a small shell script that converts a twtxt feed to the format described in RFC 4287, which describes how to convey author name/email, contributor name/email, the time of publication and the last update for a document. Since RFC 4287 also describes how to include messages, I just included them!

Here is the input file: http://daten.dieweltistgarnichtso.net/tmp/docs/twtxt.txt Here is the output file: http://daten.dieweltistgarnichtso.net/tmp/docs/twtxt.xml

erlehmann commented 8 years ago

@reednj RFC 5005 describes a mechanism to link together several physical documents that form one logical document. It is not that hard it seems, as long as the first document contains the metadata about the aggregate.

erlehmann commented 8 years ago

@reednj I see a problem with your example as it does not give URLs in the source, only nicknames. In reality, you would need the URL.

erlehmann commented 8 years ago

@reednj I am not familiar with yaml. How can you do namespaces in yaml? As far as I see, you would need namespacing for forwards compatibility.

reednj commented 8 years ago

So sounds like commented YAML could be the way to go? I wonder if @buckket has an opinion?

Also, please no namespaces, that is the very definition of YAGNI

erlehmann commented 8 years ago

reednj could you explain how a format can be extensible if you do not have namespaces without basically ignoring everything in the file that is not in the default namespace? Or is the metadata format you envision a fixed format without any additional semantics, ever?

otherjoel commented 8 years ago

Personally I would love to see twtxt either commit to a truly minimalist “no metadata” stance, or simply use Atom as the default format in a single file. Atom has everything you need. It is not the most terse file format; the existing twtxt format is the most terse if that’s what you’re shooting for. But as soon as we start trying to approximate feature-parity with Twitter, it’s likely we’ll just end up reinventing Atom/RSS poorly. Atom is human-readable, it’s a truly well-made and well-defined standard, there’s widespread support for it.

reednj commented 8 years ago

You can have meta data about the user at the top of the file, without having any meta data about the messages, which is basically what I'm pushing for.

I don't think we can or should or need to compete with twitter. The appeal of twtxt is its simplicity, and xml is the opposite of that in every way.

mkody commented 8 years ago

I second @reednj.

twtxt is a decentralised, minimalist microblogging service for hackers.

The minimalist part here needs to stay. The fact that we can use only one (or two soon?) lines for each tweets make it simple and clear to use.

Benaiah commented 8 years ago

You can have meta data about the user at the top of the file, without having any meta data about the messages, which is basically what I'm pushing for.

I agree - we need user data for any sort of network propagation, but the messages themselves should remain as ephemeral and simple as they are currently. I think you hit the nail on the head.

erlehmann commented 8 years ago

@mkody as I said, twtxt can be an input format for an already existing representation, like Markdown. Try http://news.dieweltistgarnichtso.net/bin/twtxt2atom out and you might see what I am proposing.

@Benaiah what is “network propagation” ?

otherjoel commented 8 years ago

the messages themselves should remain as ephemeral and simple as they are currently

So to be clear, official support for things like replies to chain messages together in conversations are absolutely off the table? If so, then that feels consistent and I can dig it.

mkody commented 8 years ago

@erlehmann So you mean that we could keep the twtxt file and make an atom feed from it? For the atom to have some sort of metadata, it means that our input (the twtxt file) should have them somewhere too. That feels redundant to use two files for the same purpose. And convert the file every time.

DracoBlue commented 8 years ago

I like the way @reednj posted!

Advantages:

I really like atom and especially atom sync protocol, but twtxts simplicity and posting to your feed as simple as TIMESTAMP\tmessage is what makes it a very nice format to host on whatever webspace and post it with whatever client you have.

Everything we add with # like I suggested in #22 is an extra and should not be mandatory. Even though having yaml in twtxt like @reednj posted, could make the config file nearly unecessary ;).

mdom commented 8 years ago

After thinking about this topic for a few days, I'm sure benaiah's first suggestion would be a very good fit for twtxt. If we just use comments like

# follow david http://example.org/david.txt
# unfollow http://example.org/user.txt
# nick mdom
# twturl http://example.org/user.txt 

somewhere in the file, it would be very easy even for the most simple client to read and write metadata in the feed. Whereas with things like yaml or ini you couldn't just read the file line by line and you probably need a parser to do the work. And this format would also allow the record who you once followed or your old twturl if somebody needs that. And for the argument about needing to parse the whole twtfile just to get the metadata: We currently are parsing the complete file every time to build the timeline so i'm not sure if this is even an issue.

I have the strong feeling we should just use the easiest and most minimal solution one can think of. I mean, that's what twtxt is all about, right? :)

archusr commented 8 years ago

mdom's suggestion sounds very reasonable. I also like the log style approach therein.

mdom commented 8 years ago

We talked a little about it on irc, and we would also propose to add a timestamp to the comment, so the client can reorder metadata as it seems fit. Some would leave it interspersed in the file and others could move metadata to the top of the file.

archusr commented 8 years ago

to still allow for simple sorting by timestamps, irc style commands could be an alternative to # comments:

# 2016-03-06T23:23:23Z  follow user https://example.org/user/twtxt.txt
2016-03-06T23:23:23Z    /follow user https://example.org/user/twtxt.txt
Lymkwi commented 8 years ago

to still allow for simple sorting by timestamps, irc style commands could be an alternative to # comments

Then tweets cannot start with a '/' (0x2F) character anymore. I don't think it's that much of a bother compared to what metadata storage can do, and I assume it's easier to parse than having to determine that the first character is a '#' and parse date and metadata altogether. He you can just parse things naturally using the existing methods, and if the first character of the message is a '/', then store that lline as metadata, not a tweet. I was wondering when I started thinking of storing metadata : where you we store them once they're downloaded? Of course I thought of the Cache, but it isn't very generic, it was designed to store tweets, and adding metadata managing in it requires some twisting of its current methods...

mdom commented 8 years ago

Though i still prefer the lines starting with comments, this would be also a fine choice. It's a good point that you wouldn't have to add special syntax. But i wonder how often users want to start tweets with /me or path names and then you need some kind of escaping mechanism... :/

otherjoel commented 8 years ago

If this is the approach it would be better to use some uncommon unicode character (e.g. or ) instead of a slash.

Benaiah commented 8 years ago

Maybe a vertical tab would work :P

On Mon, Mar 7, 2016 at 1:57 PM -0800, "Joel Dueck" notifications@github.com<mailto:notifications@github.com> wrote:

If this is the approach it would be better to use some uncommon unicode character (e.g. ? or ?http://www.fileformat.info/info/unicode/char/261e/index.htm) instead of a slash.

Reply to this email directly or view it on GitHubhttps://github.com/buckket/twtxt/issues/48#issuecomment-193473667.

mdom commented 8 years ago

Maybe we can use C99 oneline comment syntax. Using // would be visible distinctive, shouldn't be that common in normal tweets and it feels like a rather nice fit for a service for hackers.

DracoBlue commented 8 years ago

One could also use a twtxt tweet (but autogenerated):

/me is following @<dracoblue https://dracoblue.nez/twtxt.txt>

and parse this on client side.

But for general meta, like the preffered nickname, real meta data without a timestamp would be more useful.

DracoBlue commented 8 years ago

After rereading the entire issue:

I think:

TIMESTAMP\t/ACTION parameters

where ACTION is something like "follow", "unfollow" or whatever, is the best way. And it is up to the creator of the twtxt to keep the "important" meta data (like nick) within the file, if older tweets are removed.

The nice thing about this is: clients can implement /follow dracoblue https://dracoblue.net/twtxt.txt as normal command and can format it when it gets printed (e.g. "is following @dracoblue" or "changed nickname to @dracoblue").

And the best: it is 100% backwards compatible.

archusr commented 8 years ago

thanks for picking that up, these three variants are equally appealing to me:

timestamp     /action parameters
timestamp     // action parameters
timestamp     # action parameters
DracoBlue commented 8 years ago

Having # is difficult if you still want to allow #hashtags at the beginning of a tweet.

mdom commented 8 years ago

I wouldn't seperate metadata into two different categories. This makes client side parsing just harder. There's no reason to hide that you changed your nick and even when you can just delete your old nick statement (although everybody else would still know). If we really want to use # we could just double it. But after a few days of using // it feelds very natural to use it and to visually skip it if i see it in a feed. But i'm fine with any syntax as long as a decision is made... :)

quite commented 8 years ago

Also, double slash // doesn't collide with people casually using irc-style /me, and such...

DracoBlue commented 8 years ago

I think also /me is possible, if a client only interprets those actions, which it is aware off, but prints those, which it cant handle.

I would also like IRC style in twtxt clients for /me:

@dracoblue likes this discussion

as a result of /me likes this discussion.

PS: same thing for /dnd and /away.

mdom commented 8 years ago

On Sun, Mar 13, 2016 at 07:18:22AM -0700, DracoBlue wrote:

I think also /me is possible, if a client only interprets those actions, which it is aware off, but prints those, which it cant handle.

Wouldn't that be weird for clients that process no metadata? They have to show all metadata and clutter their users timeline. If we have a clear syntax that distinguish comments from normal tweets they can just hide all metadata. I know, we won't be able to find a syntax that will never clash with normal text messages, but i think we should try to find one that's less likely to occur.

I would also like IRC style in twtxt clients for /me: @dracoblue likes this discussion as a result of /me likes this discussion.

But i like this! I'll try to implement this in txtnix tonight. Sounds like a fun option.

DracoBlue commented 8 years ago

They have to show all metadata and clutter their users timeline.

I think there won't be that much "meta spam". Nick or twtxt url changes happen very seldom, same for following/unfollowing other twtxteers ;). If you compare that to twtxtlist's or directory's updates, its VERY seldom ;).

adiabatic commented 8 years ago

I could see myself or another person writing a post that simply copies a particularly interesting code comment, with the // or # left in, à la:

// drunk — fix later

You are not expected to understand this.

"You can't start your posts a certain way" sounds like a needless source of user confusion, both for new users and old users who can't remember the rules for something they do infrequently (like start a post with // or #).

archusr commented 8 years ago

We could define one reserved word, as in:

timestamp     /twtxt action parameters
DracoBlue commented 8 years ago

If we take IRC, you cannot start your text with a slash, too.

If we need the date of the action, putting it into a normal message and prefixing it with / will work (with the drawbacks mentioned).

If we don't need the timestamp, there is no real reason to integrate it as some kind of special message. So we are at:

#nick dracoblue
TIMESTAMP\tmy post

again ;).

Since I really want to have metadata in the twtxt, to finish the persistent storage for https://web.twtxt.org - it would be good to have a decission on this. /cc @buckket

mdom commented 8 years ago

I would really like to have a defined order of metadata. For example it would be really useful for follow/unfollow command, or you can define multiple twturls and the last should be used for fetching but the others urls could still be used for collapsing mentions etc.# timestamp nick dracoblue again? But i feel we now have iterated through all possible ways to define metadata multiple times ... :)

adiabatic commented 8 years ago

If we take IRC, you cannot start your text with a slash, too.

Most mature IRC clients have a way of sending something that starts with a slash to a channel, whether by making the user write two slashes, press control-enter, or write /msg #twtxt /me is the command we're using.

What about

TIMESTAMP action

vs.

TIMESTAMP\tpost

to distinguish actions from posts? Namely, actions and metadata start with a space, while posts start with a tab.

adiabatic commented 8 years ago

More ideas on TIMESTAMP action (as opposed to TIMESTAMP\tpost):

For a belt-and-suspenders approach, one could do

2016-03-17T21:16:56Z /PREFERREDNICK katabatic

That is, posts match "{}\t{}" whereas actions match "{} /{}" (in Python str.format() minilanguage)

archusr commented 8 years ago

In the above comments are examples of lines to be parsed as ...

(0) timestamp /action parameters
(1) # timestamp     action parameters
(2) timestamp       /action parameters
(3) timestamp       // action parameters
(4) timestamp       # action parameters
(5) timestamp       /twtxt action parameters
(6) timestamp       /me likes this discussion
(7) timestamp       // drunk — fix later
(8) timestamp       # You are not expected to understand this.
(9) timestamp#action parameters

Looking at these, it seems we could/should identify metadata as 0, 2 or 5, with 5 being most strict? // edited to add 9

DracoBlue commented 8 years ago

@archusr thanks for summarizing!

I think (2) and (5) are good ways, too.

I implemented (2) in https://web.twtxt.org (and changed my https://dracoblue.net/twtxt.txt accordingly) but it is not a big deal to change it to (5).

@buckket what do you think?

mdom commented 8 years ago

If we're leaning to option two or five, i would prefer 5 as we wouldn't have to code special cases to prevent /me from disappearing. I change txtnix accordingly. @quite, @DracoBlue would you change your clients too? Can maybe somebody with more python chops add it to twtxt and send a PR?

adiabatic commented 8 years ago

@DracoBlue What do you like about 2 and 5 that you don't like about 0? Because it uses a space instead of a tab, there's no way for a user to accidentally make an action that was supposed to be a post — and I like that.

mdom commented 8 years ago

Overloading of whitespace is fragile. Look at make. I would even argue, that twtxt shouldn't care what kind and what amount of whitespace is between timestamp and text. Think about all the editors that are autoconverting tabs to spaces. But that's probably an issue for another time... :)

DracoBlue commented 8 years ago

@mdom Yep!

TIMESTAMP#action param

would be more explicit.

Actually 2+5 would be compatible to current clients.

So we implement

TIMESTAMP\t/twtxt action param1

In the alternative clients and somebody with python skills adds it with a PR to the official client?

adiabatic commented 8 years ago

@mdom Makes sense. If you hate

TIMESTAMP /… …

then I'd suggest

#TIMESTAMP\taction

because there's still no way to accidentally make an action.

We could, of course, have one before-the-timestamp marker for actions and another before-the-timestamp marker for comments.

adiabatic commented 8 years ago
TIMESTAMP#action

would be great. Are we sure we want to standardize on 2 or 5 for the backwards-compatibility concerns of three clients and six users, all of which can probably be updated in two hours total?