Duplicates on analyzed_media topic

fandangOrg / fandango

FAke News discovery and propagation from big Data ANalysis and artificial intelliGence Operations

1 stars 1 forks source link

Duplicates on analyzed_media topic #41

Closed federicovagnoni closed 4 years ago

federicovagnoni commented 5 years ago

We started the crawler again this morning and it seems that certh_media_service is producing duplicates.

Reading from analyzed_media topic in kafka, we can see an example: immagine

immagine

pstalidis commented 5 years ago

Did you also restart the media analysis service, or the preprocessing service?

Both services use the kafka-python library to read the kafka topics which has a known bug. The last processed article before stopping the service is analyzed again as the first article when the service starts. This has to do with starting from and not after the last committed message.

federicovagnoni commented 5 years ago

I only started the crawler. All the service were already online. But the duplicates are generated during the execution. It seems that they are not so frequent, anyway.

pstalidis commented 5 years ago

I added a live tail an hour ago and I haven't seen any duplicates. Let's monitor this a bit longer and look at it again next week.

federicovagnoni commented 5 years ago

I found two occurences in the last 10 minutes: 2 | dd8754ea5ed8093a87f3fd044dc2f9bf770a0956448c5f2effae838f8f84d81b67c6d880587a89e00641fc0d20851d9f561c792461280ff2ca1a234dffa92ab6

The first number counts the occurences in the topic. I'm pretty sure they are subsequent like I showed you before.

immagine

pstalidis commented 5 years ago

I just checked the input_raw and input_preprocessed queue. The exact same duplicates are also in the input_preprocessed but not in the input_raw topic.

So the problem seems to originate in the preprocessing service. @dmgutierrez I am assigning this to you

pstalidis commented 4 years ago

analyzed media processes duplicates originating earlier in the queue. (added issue #63 )