Redis - all rss feeds are not in redis :(

nsteinmetz commented 8 years ago

Hi,

I have three RSS feeds, when I look into redis, I see 4 keys related to rss but each time, it's twice the same content.

Feed 1 = key=:1:th_rss_5 & key=:1:th_rss_uuid_370c8640-c5ee-499c-b788-538ee98195bd
Feed 2 = key=:1:th_rss_4 & key=:1:th_rss_uuid_a25b9f8f-0ab5-4fd4-a5ba-046903798d91
No feed 3 or 4 :(

Whereas in log:

[2015-11-18 20:59:00,015: INFO/MainProcess] Scheduler: Sending due task read-data (django_th.tasks.read_data)
[2015-11-18 20:59:00,019: INFO/MainProcess] Received task: django_th.tasks.read_data[1f2b1a52-d494-48fa-b0bd-e0f6c8451280]
[2015-11-18 20:59:00,043: INFO/MainProcess] Received task: django_th.tasks.put_in_cache[84f7a9dd-96f3-4669-98cd-3a96ad982cb4]
[2015-11-18 20:59:00,046: INFO/MainProcess] Received task: django_th.tasks.put_in_cache[18aa0c44-92c2-4d3b-aa0a-a39a5220bcfa]
[2015-11-18 20:59:00,048: INFO/MainProcess] Received task: django_th.tasks.put_in_cache[25eb2ff5-9829-48de-822d-29023fa9e0a4]
[2015-11-18 20:59:00,051: INFO/MainProcess] Received task: django_th.tasks.put_in_cache[64768793-1f32-4fab-bb50-1472e86239ae]
[2015-11-18 20:59:00,052: INFO/MainProcess] Task django_th.tasks.read_data[1f2b1a52-d494-48fa-b0bd-e0f6c8451280] succeeded in 0.03180823099683039s: None
INFO user: nsteinmetz - provider: ServiceRss - Tweet blog posts - 20 data put in cache
2015-11-18 20:59:00,363 INFO tasks 20287 user: nsteinmetz - provider: ServiceRss - Tweet blog posts - 20 data put in cache
[2015-11-18 20:59:00,363: INFO/Worker-1] user: nsteinmetz - provider: ServiceRss - Tweet blog posts - 20 data put in cache
[2015-11-18 20:59:00,367: INFO/MainProcess] Task django_th.tasks.put_in_cache[84f7a9dd-96f3-4669-98cd-3a96ad982cb4] succeeded in 0.3201201710035093s: None
INFO user: nsteinmetz - provider: ServiceRss - Tweet shared feed from tt-rss nothing new
2015-11-18 20:59:00,651 INFO tasks 20288 user: nsteinmetz - provider: ServiceRss - Tweet shared feed from tt-rss nothing new
[2015-11-18 20:59:00,651: INFO/Worker-2] user: nsteinmetz - provider: ServiceRss - Tweet shared feed from tt-rss nothing new
[2015-11-18 20:59:00,654: INFO/MainProcess] Task django_th.tasks.put_in_cache[18aa0c44-92c2-4d3b-aa0a-a39a5220bcfa] succeeded in 0.6013300749764312s: None
INFO user: nsteinmetz - provider: ServiceRss - Tweet Web Enthusiasts blog posts - 20 data put in cache
2015-11-18 20:59:00,918 INFO tasks 20288 user: nsteinmetz - provider: ServiceRss - Tweet Web Enthusiasts blog posts - 20 data put in cache
[2015-11-18 20:59:00,918: INFO/Worker-2] user: nsteinmetz - provider: ServiceRss - Tweet Web Enthusiasts blog posts - 20 data put in cache
[2015-11-18 20:59:00,921: INFO/MainProcess] Task django_th.tasks.put_in_cache[64768793-1f32-4fab-bb50-1472e86239ae] succeeded in 0.2657051499991212s: None
INFO user: nsteinmetz - provider: ServiceRss - RSS To Twitter nothing new
2015-11-18 20:59:00,923 INFO tasks 20287 user: nsteinmetz - provider: ServiceRss - RSS To Twitter nothing new
[2015-11-18 20:59:00,923: INFO/Worker-1] user: nsteinmetz - provider: ServiceRss - RSS To Twitter nothing new
[2015-11-18 20:59:00,924: INFO/MainProcess] Task django_th.tasks.put_in_cache[25eb2ff5-9829-48de-822d-29023fa9e0a4] succeeded in 0.5566169520025142s: None

Feed 3 & 4 are "RSS To Twitter" & "Tweet shared feed from tt-rss nothing" ; which are the same by the way

foxmask commented 8 years ago

The Feeds are put in redis when there is something "new" (new = between the last time the trigger has run, and "today") and it's twice in cache because I put once :

th_rss_' + str(trigger_id)

and once

th_rss_uuid' + str(trigger_id)

the second line permits to provide from TriggerHappy a Feed from an UUID this is usefull when you want to track tweets from a given hashtag - then you could touch that information from the feed built from TriggerHappy.

so I think it's not a bug

nsteinmetz commented 8 years ago

Ok so there is an issue with my dotclear atom feeds as they were put in cache each times on the old instance.

On the new one, I don't see anything in fact in redis

Let's close this one so far. Need to understand better what happens...

foxmask commented 8 years ago

If you show me the RSS I could analyze. I saw so many feed not well fond, I could find it here too

nsteinmetz commented 8 years ago

Here's one : https://nicolas.steinmetz.fr/blog/feed/atom or https://nicolas.steinmetz.fr/web-enthusiasts/feed/atom

foxmask commented 8 years ago

I keep it open.

I may found a beggining of explanation with issue #129

I made 2 triggers with your atom to create not in evernote

I'm digging now with that goes to cache but does not get out of it.

It's really crazy ...

all of that to improve the perf with multiprocessing ...

nsteinmetz commented 8 years ago

It's what I used in fact, but did not change anything in my case.

If I were to implement a clone to th, I think I would:

make some data persistent (ie store part of/all the feed) in a "queue" and maybe have something I could consider as a "document" (or a dictionnary), which contains a bunch of properties (title, date, unique_id, status ?, etc)
try to find some unique key (guid ?)
Remove item from queue on the long term ; for ex if feed is 20 items, then have a 50 for ex queue. When 50 is reached, then start to drop latest content

And as you did, I would have 2 collector / publisher actions.

foxmask commented 8 years ago

in fact the data of your feeds were not published because if the database, I've setup a date that was not old enough to permit to let the feed being published. Otherwise it's working .

foxmask commented 8 years ago

About your suggestions : I was expecting to be helped by celery/redis for that. May be I should searched for a more sophisticated Queuing system which would trigger the tasks when an entry limit is reached. But I dont know if it would be satisfying with a big quantity of data to handle

nsteinmetz commented 8 years ago

Do you really have this quantity or perf requirements ?

As documents could be simple array or hstore or json documents in postgres; one could say mongodb but I would prefer not.

At work we started to use Kafka in a hadoop context, but seems overkill for such a need (having a zookeeper system + kafka nodes). I was also thinking at ElasticSearch but it seems too far from basic use cases.

And hadoop ecosystem has all the features but so overkilled :-D

If you stay with Redis/Celery, maybe you implemented it too fast. Why not start with only redis for example and then implement Celery ?

I don't know enough Redis to say if it's a good choice or not. Seems hashes could be seens as documents.

Or maybe RethinkDB too?

foxmask commented 8 years ago

I saw pykafka too, several weeks ago, when I saw a post about IFTTT architecture. And I liked it :) But as you said, adding several java process for a so little project ... :=)

Why redis : because we can use it without thinking of it because, with django we just use two things cache.set() and cache.get() to use the cache system Why Celery : because its simple to trigger several process at once ; what's I want instead of having a serial process.

nsteinmetz commented 8 years ago

in fact the data of your feeds were not published because if the database, I've setup a date that was not old enough to permit to let the feed being published. Otherwise it's working .

Not sure to understand your concept of database (redis vs postgres) and "old enough" ; otherwise, it should have worked on the initial one. But as I have dropped everything so far... cannot test back.

foxmask commented 8 years ago

When I test feed again and again, I change the date_triggered in the table triggerservices to be a date before the date of the feed.

nsteinmetz commented 8 years ago

Regarding Celery message mentionned in #51: the message on Celery mean it does not work ? I thought it was only a simple warning which will not prevent it to work. So simple warn or blocking ? As I saw something happening in the logs, I thought it worked.

foxmask commented 8 years ago

Yes it's working , it's an old comment

nsteinmetz commented 8 years ago

Ok, so I may have missed something else in the initial app. Never mind.

foxmask / django-th

Redis - all rss feeds are not in redis :( #125