Problem with dates with Reeder / greader API ?

FreshRSS / FreshRSS

A free, self-hostable news aggregator…

https://freshrss.org

GNU Affero General Public License v3.0

9.82k stars 843 forks source link

Problem with dates with Reeder / greader API ? #2759

Closed javerous closed 4 years ago

javerous commented 4 years ago

Hello.

So I noticed something weird with dates shown in my feed in Reeder (macOS).

For example, for this feed https://www.macg.co/news/feed, there is this article (item):

    <item>
        <title>Mozilla corrige une faille critique dans Firefox 72</title>
        <link>https://www.macg.co/logiciels/2020/01/mozilla-corrige-une-faille-critique-dans-firefox-72-111243</link>
        <description>
Gueule de bois en ce début d'année pour Mozilla. Firefox 72, dont la version finale est disponible depuis le 7 janvier, doit être mis à jour de toute urgence en raison de la découverte d'une faille de sécurité « 0 Day » qui a déjà été exploitée par des malandrins. Tous les utilisateurs ayant installé la version 72 du navigateur libre doivent télécharger toutes affaires cessantes la mise à jour fraichement mise en ligne.

L'affaire est manifestement de la plus haute importance : la CISA, l'agence sur la sécurité informatique américaine, a très officiellement alerté les internautes de cette vulnérabilité qui permet à un forban de prendre le contrôle de l'ordinateur infecté. Les versions 72.0.1 de Firefox et 68.4.1 de Firefox ESR (la mouture « grandes organisations » du navigateur) corrigent la faille.

À noter que la version iOS de Firefox n'est pas concerné en raison de l'utilisation du moteur WebKit.
</description>
        <pubDate>Fri, 10 Jan 2020 17:00:29 +0100</pubDate>
        <dc:creator>Mickaël Bazoge</dc:creator>
        <guid isPermaLink="true">https://www.macg.co/logiciels/2020/01/mozilla-corrige-une-faille-critique-dans-firefox-72-111243</guid>
    </item>

So the date seems to be ok (17h00 for Paris time, i.e. +1). It's displayed the right way in the Web UI (my PHP is configured with the right Timezone).

Then, if I check the https://[redacted]/freshrss/api/greader.php/reader/api/0/stream/items/contents associated to this article fetch, I see this content:

{
    "id": "user/-/state/com.google/reading-list",
    "updated": 1578673145,
    "items": [{
        "id": "tag:google.com,2005:reader/item/00059bcb73fa3a68",
        "crawlTimeMsec": "1578673009998",
        "timestampUsec": "1578673009998440",
        "published": 1578672000,
        "title": "Mozilla corrige une faille critique dans Firefox 72",
        "summary": {
            "content": "Gueule de bois en ce début d'année pour Mozilla. Firefox 72, dont la version finale est disponible depuis le 7 janvier, doit être mis à jour de toute urgence en raison de la découverte d'une faille de sécurité « 0 Day » qui a déjà été exploitée par des malandrins. Tous les utilisateurs ayant installé la version 72 du navigateur libre doivent télécharger toutes affaires cessantes la mise à jour fraichement mise en ligne.\n\n\n\nL'affaire est manifestement de la plus haute importance : la CISA, l'agence sur la sécurité informatique américaine, a très officiellement alerté les internautes de cette vulnérabilité qui permet à un forban de prendre le contrôle de l'ordinateur infecté. Les versions 72.0.1 de Firefox et 68.4.1 de Firefox ESR (la mouture « grandes organisations » du navigateur) corrigent la faille.\n\nÀ noter que la version iOS de Firefox n'est pas concerné en raison de l'utilisation du moteur WebKit."
        },
        "alternate": [{
            "href": "https://www.macg.co/logiciels/2020/01/mozilla-corrige-une-faille-critique-dans-firefox-72-111243"
        }],
        "categories": ["user/-/state/com.google/reading-list", "user/-/label/Informatique"],
        "origin": {
            "streamId": "feed/51",
            "title": "MacGeneration"
        },
        "author": "Mickaël Bazoge"
    }]
}

"updated " 1578673145 → 10/01/2020 - 17:19:05 "published": 1578672000 → 10/01/2020 - 17:00:00

So the timestamp seems to be right too.

But then Reeder show the article date 17:16, which is the exact date when the https://[redacted]/freshrss/api/greader.php/reader/api/0/stream/items/contents request was done (so when it fetched the article content).

So do you think it's a bug in Reeder, or is it something invalid in the greader protocol which makes Reeder to be lost ?

Thank you.

Note: I put the whole investigation, so if it appears it's indeed a Reeder bug, I can point this ticket to Reeder dev.

javerous commented 4 years ago

Okay, I dug a bit into Reeder assembly, and I found the problem. There a quick summary.

Reeder use only those keys in root of the item entry:

"title"
"id"
"origin"
"alternate"
"content"
"summary"
"author"
"timestampUsec"
"html_content"
"html_title"

i.e. they don't use "published" entry: they are using timestampUsec.

Then I noticed what was in front of me since the beggining: 1578673009998440 = 1578673009.998440027 seconds = 10/01/2020 - 17:16:49 + 998 msec (i.e. what Reeder show to me). So it seems coherent.

And after reading greader.php sources, it appears it's not even a "real" timestamp, but it's the id of the entry (which is probably the timestamp of when the entry was created in the database (?)).

So, I changed

'crawlTimeMsec' => '' . substr($entry->id(), 0, -3),
'timestampUsec' => '' . $entry->id(),    //EasyRSS
'published' => $entry->date(true),

'crawlTimeMsec' => '' . ($entry->date(true) * 1000), 
'timestampUsec' => '' . ($entry->date(true) * 1000000),
'published' => $entry->date(true),

on my instance, and it works fine now (the right dates are displayed in Reeder). Note: I'm not a PHP developer, so there is perhaps a better way to do that.

So, who is wrong, who is right, there ? Is it Reeder which is wrong to use this value to date articles ? Or FreshRSS which is not giving a valid timestamp ?

If there is no real right answer (if I understand well, the protocol was reverted, but there is no official doc), what we can do ?

Try to convince Reeder dev to use "published" field instead ?
Use a correct value in greader.php (but if I understand the comment, it can break compatibility with EasyRSS ?) ?
Add a FreshRSS system config, like an "API compatibility", which would let the admin choose what should go in this field ?
Detect the User-Agent, and return the right value accordingly ?

javerous commented 4 years ago

For the last point, jut in case, the User-Agent I see in the HTTPS request for the very last version of Reeder:

On macOS : Reeder/4020.29.01 CFNetwork/978.2 Darwin/18.7.0 (x86_64)
On iOS: Reeder/4020.29.03 CFNetwork/1121.2.2 Darwin/19.2.0

Frenzie commented 4 years ago

Judging by a quick peek at the EasyRSS source the comment simply means that it uses timestampUsec, not necessarily that it needs to be the specific value it currently is.

Also see https://github.com/FreshRSS/FreshRSS/commit/00774f5a0bf2eacbb1825ccbf07e3fbc7b114b4d

javerous commented 4 years ago

Hmm yes.

I don't know if I'm looking at the right place, but I see

} else if ("timestampUsec".equals(name)) {
   item.setTimestamp(Long.valueOf(parser.getText()));
}

in their code, so they are not using it for uniquely identifying the field (i.e. as an ID), but really using it for timestamping. So I guess the change would benefit this app too…

@Alkarex any thoughts ?

Frenzie commented 4 years ago

I'd never noticed in all the years I've used it, but you're right that the same discrepancy exists as described in the OP. It feels more like a feature than a bug though, in the sense that display by date fetched is much more relevant to me than the claimed date published.

javerous commented 4 years ago

Well, I didn't really notice in the first place, because my instance is configured to fetch feeds very frequently, so the fetching dates were almost the same than publishing dates.

But then it became problematic (at least for me) when I was adding news feed with long history on my instance: all the articles were set to almost the same date in Reeder (when fetched), even old articles still part of the feed. It's not very practical, for example for stories which are split to different days or weeks: you have to read them in order, and the date can be important.

By the way, the Web UI shows the publishing date (at least by default), and not the fetching date.

But if you feel that some people would be more interested by date of fetch, then perhaps it can be something configurable ?

Frenzie commented 4 years ago

But then it became problematic (at least for me) when I was adding news feed with long history on my instance: all the articles were set to almost the same date in Reeder (when fetched), even old articles still part of the feed. It's not very practical, for example for stories which are split to different days or weeks: you have to read them in order, and the date can be important.

This is the same in EasyRSS. It's actually why I refer to this feeling as a feature within the context of EasyRSS.

If FreshRSS didn't lie to EasyRSS, you'd presumably get a scenario where all the new stuff would be randomly mixed in with the old stuff so you don't know it's new. By contrast, by lying about it you get the new entries nicely grouped together in the correct order. (Ergo, I don't quite see how your problem could be a problem. They have the same superficial date, but different microseconds so they sort correctly.)

But if you feel that some people would be more interested by date of fetch, then perhaps it can be something configurable ?

That could potentially be an interesting idea; might be a bit niche though.

Alkarex commented 4 years ago

Hum, I believe FreshRSS is doing exactly what is should, and that some clients have picked the wrong field.

crawlTimeMsec is the same than timestampUsec, which just has a better precision (I believe timestampUsec might have obsoleted crawlTimeMsec which was just kept for back-compatibility). timestampUsec is used by the stream API in filters for such as ot and it when a client would like to move in articles by time (similar to what the FreshRSS Web interface does). Using the published time there would not only break the API definition but also be buggy for all articles which have a published date significantly different than the crawl date (potentially far in the future or in the past, with time zone errors, or for instance, an article declaring just "2020" as date - which is common for some scientific publications - e.g. https://github.com/FreshRSS/FreshRSS/issues/2154 ). It would also be buggy when new articles are added / discovered between two stream API requests (some articles would be skipped). Indeed, only the first crawl dates provide a robust, monotonous series of timestamps that can be used for further streaming needs, while the publication date is purely informative and unreliable.

So, I am afraid that changing this part would make our Google Reader API implementation less compliant as well as less robust.

For reference:

http://web.archive.org/web/20130708105542/http://undoc.in/stream.html#contents
https://github.com/arowser/google-reader-api/blob/master/source/glossary.rst

time in microseconds since the epoch that the item appeared in the direct stream that it was in.
https://www.inoreader.com/developers/stream-contents (see also the provided example to see the difference between crawlTimeMsec and published)

crawlTimeMsec and timestampUsec are the same date, the first with milisecond, the second with microsecond resolution. Use timestampUsec whenever possible, because we need microsecond resultion.
From the best client implementation I know: https://github.com/noinnion/newsplus/blob/2ce0002c8f4cea594fe1208922e1f4f184e98eb2/extensions/GoogleReaderCloneExtension/src/com/noinnion/android/newsplus/extension/google_reader/GoogleReaderClient.java#L747-L748

Here is an output from the original Google Reader (check the different time fields):

{
    "isReadStateLocked" : true,
    "commentInfo" : {
      "user/10311923250980613279/state/com.google/broadcast" : {
        "permalinkUrl" : "https://plus.google.com/109269993425247359567/posts/A9LxpLSKG1g",
        "commentState" : "VIEWER_CANNOT_COMMENT"
      }
    },
    "crawlTimeMsec" : "1320065162882",
    "timestampUsec" : "1320065162882657",
    "id" : "tag:google.com,2005:reader/item/c13b84553b7c7bb3",
    "categories" : [ "user/10311923250980613279/state/com.google/broadcast", "user/10311923250980613279/state/com.google/like", "user/10311923250980613279/state/com.google/read" ],
    "title" : "10/31/11 PHD comic: 'Division of Labor'",
    "published" : 1320045130,
    "updated" : 1320045130,
    "alternate" : [ {
      "href" : "http://www.phdcomics.com/comics.php?f=1449",
      "type" : "text/html"
    } ],
    "summary" : {
      "direction" : "ltr",
      "content" : "<center>\n  <table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\">        \n    <tr>\n      <td><b><font face=\"Arial, Helvetica, sans-serif\" size=\"+1\">Piled Higher\n        &amp; Deeper</font><font face=\"Arial, Helvetica, sans-serif\"> <i> by Jorge\n        Cham</i></font></b></td>\n      <td> </td>\n      <td>\n        <div align=\"right\"><b><font face=\"Arial, Helvetica, \nsans-serif\">www.phdcomics.com</font></b></div>\n      </td>\n    </tr>\n    <tr align=\"center\">\n      <td colspan=\"3\"><font face=\"Arial, Helvetica, sans-serif\"><img alt=\"Click on the title below to read the comic\" src=\"http://www.phdcomics.com/comics/archive/phd103111s.gif\" border=\"0\" align=\"top\"></font></td>\n    </tr>\n    <tr>\n      <td colspan=\"3\">\n        <div align=\"center\"><font size=\"-2\" face=\"Arial, Helvetica, sans-serif\">title:\n          &quot;<a href=\"http://www.phdcomics.com/comics.php?f=1449\">Division of Labor</a>&quot; - originally published \n10/31/2011  \n        </font><p><font face=\"arial\">For the latest news in PHD Comics, <a href=\"http://www.phdcomics.com/comics.php\">CLICK HERE!</a></font></p>\n \n</div>\n      </td>\n    </tr>\n  </table>\n</center>"
    },
    "comments" : [ ],
    "annotations" : [ ],
    "origin" : {
      "streamId" : "feed/http://www.phdcomics.com/gradfeed.php",
      "title" : "PHD Comics",
      "htmlUrl" : "http://www.phdcomics.com"
    }
  }

Frenzie commented 4 years ago

@Alkarex Thanks for investigating. :+1:

javerous commented 4 years ago

@Alkarex I (somehow) understand, but there was never official API for that, right ? I mean it's based on Google API, and they never documented it, so it was mostly documented by reverse engineering, right ? I'm wondering why we should be so strict about what was mainly guessing…

Another point, it seems that EasyRSS is also using timestampUsec to timestamp its articles (I don't see any references to "published"). I'm not sure how much there is GReader clients in the land, but it seems it was the norm, no ? Shouldn't the usage (even if unfortunate) define the API behavior ?

As I said, when you subscribe to an article the first time, I think (at least I am) most people except the articles to be ordered by their publication date, and not by their fetching date. And if an article only gives a year, it will not change that much the actual mess it is when you subscribe, when they all appear "randomly" at the same date… At worst it will not be worse.

Anyway, what about adding an option somewhere, disabled by default, to be able to change this behavior ? You can see that as an option "not following the specs but which enhance compatiblity with what actuals clients are doing"...

javerous commented 4 years ago

but also be buggy for all articles which have a published date significantly different than the crawl date (potentially far in the future or in the past, with time zone errors

I'm not sure to understand this point, what is the problem of having a big gap between crawl date and published date ? An RSS client wouldn't even be aware of that, as it would receive only the same date (with different mutliples). And again, for an RSS client, the crawling date is not very interesting. And for some reason, they are all using that to date articles, anyway.

It would also be buggy when new articles are added / discovered between two stream API requests (some articles would be skipped). Indeed, only the first crawl dates provide a robust, monotonous series of timestamps that can be used for further streaming needs, while the publication date is purely informative and unreliable.

Requests to the server are done by giving crawling date intervals and not by entries ID ? This API is weird, I would have expected that requests be something like "give me all articles published since this entry ID" and not "give me articles since this fetch timestamp" 🤔

Okay, I can understand this point, then, but an option to use this publish dates would be great anyway. As said, all my feed are all using valid (reliable) publish dates, so I guess that not that common to have unreliable dates, and so it can make sense for some people to use that.

EasyRSS is open source, so it's something which can be easily changed, but Reeder is not, and I'm pretty sure it will not be fixed. I'm using this fix for a week in my instance, and I didn't have any problem. I would find annoying to have to patch it on my instance at each update :s

Frenzie commented 4 years ago

But the EasyRSS usage isn't unfortunate at all, it's exactly as it should be. It may or may not have picked the wrong field for display but it's definitely picked the right field for sorting.

javerous commented 4 years ago

Well yes, if the whole API is based on timestamp for fetching articles, I undertand. It's an unfortunate choice for an API from my point of view, but if it is what it is…

I'm not sure to understand the "shorting" point. Is it something needed by the API ? Or you mean sorting in the UI ? If it's for UI, I don't understand who would want to sort by fetching date, especially regarding the moment you subscribe to a new feed…

For EasyRSS, as they never fetch the "published" field, I guess they use this "timestampUsec" field for display too. And when I tested on an Android, this fix made the app to display the published date, like in FreshRSS UI.

Anyway, I think I'm not going to convince you, especially if the API was badly designed by Google (sill from my PoV ^^), but considering an option to be able to change this behavior (between "spec" and "actual usage") would be nice, at least ;)

javerous commented 4 years ago

In fact, it can be a global configuration, or by feed configuration.

If you know that a specific feed is not using good publishing date (or not accurate, like article giving just a year of publication), you can configure it to use the date of fetch (so even if your server hosting FreshRSS has a problem and is not able to update feeds, making it out-of-sync with real publication date, you don't lose anything, as the publication date was inaccurate anyway).

And if you know that a specific feed is using good publishing date, then you can configure it to use the publication date for fetching date. In this case you would have accurate dating for RSS clients which use the fetching date to date articles, and you wouldn't be out-of-sync if the server hosting FreshRSS has a problem.

The default would be to use fetch date, and it would be up to the user to change that, if wanted.

Be open-minded, it look like a good compromise, no ? ;p

I can even implement it.

Frenzie commented 4 years ago

I'm not sure to understand the "shorting" point. Is it something needed by the API ? Or you mean sorting in the UI ? If it's for UI, I don't understand who would want to sort by fetching date, especially regarding the moment you subscribe to a new feed…

I'm sure the API is absolutely terrible, but I definitely don't want to sort by date published. RSS is like email for websites. When you subscribe to a feed, the website "sends" you its latest updates. Any other order makes no sense to me. Then what's brand new for me would randomly be sorted among old content just because it was published 20 years ago.

Be open-minded, it look like a good compromise, no ? ;p

I can even implement it.

I never objected to that. ;-) I was going to implement it myself when I mistakenly thought there was a bug in the API implementation. By "niche" I only meant I don't quite know if it should be part of the core, but the core should facilitate it either way.

Frenzie commented 4 years ago

Basically I mean adding something relatively trivial to the core vs. adding a potentially slightly less trivial but much more versatile extension hook. Assuming one doesn't exist yet.

javerous commented 4 years ago

I'm sure the API is absolutely terrible, but I definitely don't want to sort by date published. RSS is like email for websites. When you subscribe to a feed, the website "sends" you its latest updates. Any other order makes no sense to me.

I'm so astonished by your point of view that I'm to the point I'm not sure I understood well :p… We are perhaps deviating from the original subject, but I want to understand (and it's interesting anyway). Sorry if it creates noise in this ticket :s

When you say "by date published", you mean from UI perspective, right ?

You are comparing with e-mail, but everyone read its e-mail sorted by sent date (i.e. "published", to stay in the comparison). I even checked in some of my e-mail to check it indeed use the date sent, and not receive dates added by servers. And I checked a bunch of e-mail clients, none of them propose to sort your list by date received. Do you know such e-mail client which keep order at which it receive the e-mails from the server, except by using telnet, I mean…

I'm not sure in which order server sent you new e-mail, but you need all of them to have proper order in your list, and not gap inside.

For e-mails, you can have to follow conversation, which needs sequentiality. It can happen with RSS articles as well, where they can need sequentiality, by publishing date, in the order the author write and publish them (for a series), or in the order events happens for news, etc. It's the case for most of my feed, at least.

If you clients (e-mail or FreshRSS) fetch your e-mails or your RSS feed frequently, like each 10 minutes, then everything is going to be in the right order (i.e. in the sequential order of their writing), and publishing date is going to be (almost) the same as fetching date, in-sync, and it's (almost) okay. But as soon as you quit your mail client, you upgrade your server hosting FreshRSS, or have a network problem which late for a week or more, then you are not going to be in-sync anymore, and then the sequentiality would be completely broken.

By the way, FreshRSS UI sort articles by publishing date, not by receiving date. It's not even possible to sort by receive date.

Then what's brand new for me would randomly be sorted among old content just because it was published 20 years ago.

Why random ? If they have proper publication date / sent date, they would be sorted by this dates. It wouldn't be random. I became random if you sort by receiving date. If you fetch all of them at the same moment, what is going their relative order ? If your server is doing parallel fetching, the order can even depend of some opaque things like thread scheduler, network delay, etc. It's near the definition of random. Publishing date is the definition of "not random".

By "niche" I only meant I don't quite know if it should be part of the core, but the core should facilitate it either way.

Well, I don't know how much there is GReader API client around there, but the most known which are compatible (EasyRSS and Reeder, apparently) are using this field to date (and order) articles in the UI, so… And as said, I think most people want publishing order (okey we can open a poll for that point, perhaps :p).

I suspect that EasyRSS and Reeder used this receive date on purpose because of @Alkarex point, i.e. some feed was not using proper publishing dates, so it was, in some (rare ?) cases creating order problem.

Giving the ability to configure this behavior by feed seems a good thing, to me, to work around those corner cases.

Basically I mean adding something relatively trivial to the core vs. adding a potentially slightly less trivial but much more versatile extension hook. Assuming one doesn't exist yet.

I will take a look for the extension hook, but if one already exists, it needs to be called on the API anyway, which is not the case for now (it's currently based on id, so we can't change them in an extension, it should be something dedicated in updating "display receive date"). And if no one exists, I totally support you for adding one (I can do it too, it doesn't seem that complex).

And I'm totally open to write an extension, even if it to keep only for myself, if an extension-hook is put in place.

@Alkarex would you be open for such hook ? If yes (🤞) what would be its name ? "receive_date_before_display" ?

javerous commented 4 years ago

In fact, it's something which can be handled by entry_before_display hook if we add a new property to FreshRSS_Entry which would store the receive date, which would contain the id by default, and which would be used in greader for those 2 fields.

An extension would just have to exchange it with $this->date() if needed.

Frenzie commented 4 years ago

You are comparing with e-mail, but everyone read its e-mail sorted by sent date (i.e. "published", to stay in the comparison). I even checked in some of my e-mail to check it indeed use the date sent, and not receive dates added by servers. And I checked a bunch of e-mail clients, none of them propose to sort your list by date received. Do you know such e-mail client which keep order at which it receive the e-mails from the server, except by using telnet, I mean…

You're talking about technicalities, while I'm talking about experience. Many clients will offer you a choice between both, but there's no practical distinction between the two 99.9 % of the time. If an e-mail suddenly showed up two days ago, do you think people would consider that a feature or a glitch? I'm sure you can guess my answer. They might not even notice it unless they specifically filtered for unread email! And if they're the type of people who dismiss by subject line without deleting, maybe not even then.

Do you know such e-mail client which keep order at which it receive the e-mails from the server, except by using telnet, I mean…

There's this barely used niche little program called Microsoft Outlook… ;-) Besides that, I've recently heard of this new little web-based thingy called Gmail, I'm sure it won't catch on, but it seems that it uses date received too.

Sarcasm aside, I believe Thunderbird uses date sent by default, but it can use date received if you want, or you can use a simple numerical received order counter for sorting while keeping date sent for display. Outlook can likewise use date sent.

then you are not going to be in-sync anymore, and then the sequentiality would be completely broken.

Exactly! ;-)

By the way, FreshRSS UI sort articles by publishing date, not by receiving date. It's not even possible to sort by receive date.

It's the other way around, for better or worse. Note that sorting by date received naturally entails sorting by date sent as well. This is true for both newsfeeds and email.

Why random ? If they have proper publication date / sent date, they would be sorted by this dates. It wouldn't be random. I became random if you sort by receiving date. If you fetch all of them at the same moment, what is going their relative order ? If your server is doing parallel fetching, the order can even depend of some opaque things like thread scheduler, network delay, etc. It's near the definition of random. Publishing date is the definition of "not random".

Again, experience, not technicalities, as well as a colloquial use of the word random. It doesn't matter if that unopened envelope you stuck in my archive is technically completely predictably sorted by date. The fact that it's in the archive in the first place is what's "random."

Some people like to use silly acronyms like FIFO to refer to our everyday world. The acronym may be silly, as well as the assumption that it's some kind of revelation when everyone works that way just without having an acronym for it, but the principle isn't. It describes what you do when you come home with groceries: you put the new stuff at the back. It describes how you deal with receiving and reading mail. It may not describe how you deal with archiving mail. You're basically talking about an archive.

This is true even if your daily newspaper wasn't delivered. You'll call up the company, and they'll send a new one. If tomorrow's newspaper and the missed newspaper arrive simultaneously, you read them in order. This is what happens when sorting by date received. But if tomorrow's newspaper arrives first, you're not going to sit on it until the missing one finally arrives.

Giving the ability to configure this behavior by feed seems a good thing, to me, to work around those corner cases.

That may be an argument for putting it in the core, but I'm definitely talking about global behavior. I don't have any problematic feeds, or at least none that are problematic in that way. ^_^

javerous commented 4 years ago

It's the other way around, for better or worse. Note that sorting by date received naturally entails sorting by date sent as well. This is true for both newsfeeds and email.

It's true only if you are continuously receiving the e-mails on your computer. If you turn off your computer for a week, then when turning-on again, the received date will be +1 week compared to sending date, with an order probably somehow random. It's also true for intermediate servers which act as client & server, like FreshRSS (even if it's more rare for them to be offline for a whole week).

Anyway, I didn't know it would be possible to be so much in disagreement in such subject :D

I don't think it worth continuing this specific discussion… It doesn't mean I'm right, but we all put our arguments on the table, and it looks similar than argumenting about "pain au chocolat" vs. "chocolatine", or argumenting about the fact that lemon tart is better with meringue or not. We are not going to conclude ;p

That may be an argument for putting it in the core, but I'm definitely talking about global behavior. I don't have any problematic feeds, or at least none that are problematic in that way. ^_^

Okay. So what next ? I would like this ability (option, ext, etc.) to change this behavior not being dropped in /dev/null, so I don't have to re-patch my FreshRSS instance on each update ;)

I'm not a PHP developer, but if the thing is not too complex, I can participate to the dev effort. @Alkarex do you have any advice on your side ?

Frenzie commented 4 years ago

It's true only if you are continuously receiving the e-mails on your computer. If you turn off your computer for a week, then when turning-on again, the received date will be +1 week compared to sending date, with an order probably somehow random. It's also true for intermediate servers which act as client & server, like FreshRSS (even if it's more rare for them to be offline for a whole week).

No, this is always true even if you turn it off for ten years. See my newspaper analogy.

Frenzie commented 4 years ago

The very fact that you noticed this highly desired feature/annoying problem proves the point.

it became problematic (at least for me) when I was adding news feed with long history on my instance:

Those new articles are all ordered by date received and date published.

javerous commented 4 years ago

No, this is always true even if you turn it off for ten years. See my newspaper analogy.

Yes, and obviously I don't agree with this newspaper point. But it was not really my point. It was not about missing e-mail between range of e-mail received. It was just considering sorting only by receiving date. When you turn on again your computer, the receiving date is going to be, well the moment you open your e-mail client, and start receiving the e-mails since you turned off your computer.

Those new articles are all ordered by date received and date published.

Without this "fix" ? Not they are not. At least not on my RSS client. They all appeared at the same date (as they where all received almost at the same time, when I subscribed to the feed), at "random" order (which was not the publishing order at all). It was a real mess. It short them by using this timestampUsec key, which is the received date of the entries in FreshRSS. You can't have garantee that the order will be the same as date published.

I mean:

For RSS, there is no garantee that your RSS feed entries, as send by the original server to which you subscribe, will be ordered by publishing date, and I guess that FreshRSS create entries in the same order than in the received feed. And this is even more true when you subscribe to mutliple feeds.
For e-mail, there is no garantee that your IMAP server will give you e-mails in the sending order. It really depend on how it store its things.

It's like expecting that ls on a directory will list your files by creation date or by file name if you don't force the order: it depend how the FS store the entries, and how it walk on them.

By the way, how is it possible to order something by using two keys, especially if they are conflicting ? I mean, you can prioritize one key regarding another, for the case two items have the same date, but without that…

javerous commented 4 years ago

@Alkarex Not to make a point, just for technical information: I took a look at this weird ot stream field.

If I well understood, the client queries IDs of entries crawled by the server since a specific date, and then fetch the actual articles it needs by querying a list of ID to the server.

Apparently, when Reeder fetch unread article with this ot field (when it fetch stared or read articles, it doesn't use ot at all, just a limit by using n), it doesn't use at all the last crawling date it received or last article date. It just use some internal delay (for me it was 1578047252, i.e. 03/01/2020 - 11:27:32 just now, when I launched it, so look like 2 weeks before the current date).

Not sure what EasyRSS is doing, but for Reeder, having crawling date based on publish date would never make it miss articles, as it doesn’t base its query on it. The only way to miss articles is probably to don't run Reeder for more than 2 weeks (which make this choice a bit weird, btw).

Frenzie commented 4 years ago

Without this "fix" ? Not they are not. At least not on my RSS client.

That would be an issue with Reeder then, and a rather annoying one to boot. I'm afraid I misunderstood your problem. In FreshRSS and EasyRSS, adding a new feed looks like this:

Screenshot_2020-01-17_11-10-02

This way your newly added stuff doesn't get buried down who knows where.

You may disagree that it's the most sensible option, but it's definitely sorted by date received and date published.

You can't have garantee that the order will be the same as date published.

It's guaranteed by some minor trickery with the date received. In reality, all articles in a new feed are received all at once. But for sorting purposes, they're entered incrementally by date published. Although in a way, that reflects how they're actually received… ;-)

For RSS, there is no garantee that your RSS feed entries, as send by the original server to which you subscribe, will be ordered by publishing date, and I guess that FreshRSS create entries in the same order than in the received feed. And this is even more true when you subscribe to mutliple feeds.

That's irrelevant. They're sorted by date published, and if they weren't that'd be a bug.

For e-mail, there is no garantee that your IMAP server will give you e-mails in the sending order. It really depend on how it store its things.

Analogies can and do break down, although if it does it would be on the SMTP side of things. Date received is date received on the server, not date received in your client.

It's like expecting that ls on a directory will list your files by creation date or by file name if you don't force the order: it depend how the FS store the entries, and how it walk on them.

I very much expect ls to list my files one of those ways, preferably aphabetically (or more realistically and a lot more unfortunately, probably ASCII-betallically). Not doing so would be an absurdly odd default.

Frenzie commented 4 years ago

Not doing so would be an absurdly odd default.

Although a lot of these Unix-y tools do have absurdly odd defaults, so… it's all up in the air. ;-)

javerous commented 4 years ago

Okay, so I think we perhaps agree in fact (?).

If the entries are inserted incrementally by date published, then if you order by IDs (i.e. date fetched), then obviously, even if this fetching dates is not the published dates, the order will be be the publishing order!

It's a point I didn't know, so it probably explain my whole misunderstanding.

For the WebUI, it's what I said on my first message:

So the date seems to be ok (17h00 for Paris time, i.e. +1). It's displayed the right way in the Web UI (my PHP is configured with the right Timezone).

But it was not the case in Reeder. And then, according to what you are saying just now, my guess is that Reeder is perhaps just cutting the times to keep only the seconds. So it would all finish at almost the same "second" timestamp. Which can explain the mess.

For EasyRSS, perhaps it keeps the whole microsecond precision for ordering (even if it's weird to order by using fetch timestamp, except if this publishing ordering is documented for greader API), but for me there is still a problem: it uses timestampUsec only, which means that, even if the order is right, the shown date will be the fetched date, and not the publishing date. And I don't agree it's something desirable (and back to my e-mail where it shows sending date, and not fetching date).

And again, FreshRSS (in the WebUI) use the right publishing order (which is perhaps the "fetching" order - as you explained, it should be the same, because of FreshRSS trick), and show the publishing date in the WebUI (I tested, it was what I explained on my first message: it was showing 17:00 in the UI, i.e publishing date, and not 17:19, i.e. fetching date).

javerous commented 4 years ago

I very much expect ls to list my files one of those ways, preferably aphabetically (or more realistically and a lot more unfortunately, probably ASCII-betallically). Not doing so would be an absurdly odd default.

Yes, exactly ! It's why I was thinking that your point was absurd !

But I better understand, now I know that relative order of fetching is ordered by publishing date.

Frenzie commented 4 years ago

To be clear, in actual practice sorting by date published isn't a huge issue or anything. You can just view the individual feed and it's fine that way. But to me the date received column was always largely just visual noise in clients like QuiteRSS. It's a bit of a UI/UX issue, wanting to sort by a property without necessarily caring all that much about what it actually says. (In Thunderbird you can drag it into invisibility, but that doesn't really work either.)

Screenshot_2020-01-17_12-11-57

even if it's weird to order by using fetch timestamp, except if this publishing ordering is documented for greader API

It's done this way on many aggregators. The Old Reader does the same thing, for example. (Iirc so do OwnCloud News and many others but I only double checked The Old Reader.) It's possible that it's primarily a default on those that are more inspired by Google Reader.

for me there is still a problem [in EasyRSS]

For me too, albeit in a fairly theoretical way. ;-) I might fix it up in EasyRSS.

javerous commented 4 years ago

Okay, good, I think we are somehow okay.

But still, about Reeder, and according to what I read here and there, he is not going to change its code anytime soon (i.e. use published field to show articles date).

So it would be nice to add the ability to change what FreshRSS give to the client. As a compatibility layer.

And I think that what I suggested is an elegant (?) and quick way to do it.

For example, would be a matter of Entry.php change:

...
    private $date;
    private $date_added = 0;
... 
public function _id($value) {
    $this->id = $value;
    if ($this->date_added == 0) {
        $this->date_added = $value;
    }
}

public function _dateAdded($value) {
     $this->date_added = $value;
}

public function dateAdded($raw = false, $scrape=6) {
    if ($scrape > 6) {
        $scrape = 6;
    }
    if ($scrape == 0) {
        $date = $this->date_added;
    } else {
        $date = intval(substr($this->date_added, 0, -$scrape));
    }
    if ($raw) {
        return $date;
    } else {
        return timestamptodate($date);
    }
}
…

And then in greader:

'crawlTimeMsec' => '' . $entry->dateAdded(true, 3),
'timestampUsec' => '' . $entry->dateAdded(true, 0), //EasyRSS + Reeder

And finally, an extension hooking entry_before_display would just have to call entry-> _dateAdded(xxx); if needed. For example, in this case, $entry-> _dateAdded($entry->date(true));

And if we should not change the existing dateAdded() (I see it's used in a bunch of other place), it can be another property.

What do you think ?

Frenzie commented 4 years ago

I don't know about the names, but it looks fine at a glance. I'm in Windows atm so I can't quickly search properly what it currently looks like.

Btw, here's Outlook (default settings): 2020-01-17 17_54_15-Inbox - Outlook

javerous commented 4 years ago

@Frenzie And this receive date is the actual date when Outlook actually received those messages in the end computer (and not a mix between send & receive dates) ?

That make so much no sense to me to 1/ show that 2/ order messages by that 3/ have that as being the default.

And so, by example, each morning when you turn on this computer, you see all the e-mail that people sent to you yesterday evening (or in the night) showing the same hour (i.e. when you Outlook open) ?

No, I'm not going to tell it's probably why I don't use Windows and GMail. #preterition

I will prepare a PR with my proposal, and we will discuss from there. I really want this feature ^^'

Frenzie commented 4 years ago

And so, by example, each morning when you turn on this computer, you see all the e-mail that people sent to you yesterday evening (or in the night) showing the same hour (i.e. when you Outlook open) ?

No, of course not. ;-) It's when it arrived on your own server. Afaik it's the same for POP3, IMAP and MS Exchange. But I haven't used POP3 in, well, I can actually almost say decades.