Closed federicovagnoni closed 4 years ago
Did you also restart the media analysis service, or the preprocessing service?
Both services use the kafka-python library to read the kafka topics which has a known bug. The last processed article before stopping the service is analyzed again as the first article when the service starts. This has to do with starting from and not after the last committed message.
I only started the crawler. All the service were already online. But the duplicates are generated during the execution. It seems that they are not so frequent, anyway.
I added a live tail an hour ago and I haven't seen any duplicates. Let's monitor this a bit longer and look at it again next week.
I found two occurences in the last 10 minutes: 2 | dd8754ea5ed8093a87f3fd044dc2f9bf770a0956448c5f2effae838f8f84d81b67c6d880587a89e00641fc0d20851d9f561c792461280ff2ca1a234dffa92ab6
The first number counts the occurences in the topic. I'm pretty sure they are subsequent like I showed you before.
I just checked the input_raw and input_preprocessed queue. The exact same duplicates are also in the input_preprocessed but not in the input_raw topic.
So the problem seems to originate in the preprocessing service. @dmgutierrez I am assigning this to you
analyzed media processes duplicates originating earlier in the queue. (added issue #63 )
We started the crawler again this morning and it seems that certh_media_service is producing duplicates.
Reading from analyzed_media topic in kafka, we can see an example: