Fix `url` and `thread_url` in Combined Summary Documents

kouloumos commented 1 week ago

The "Push Combined Summary From XML Files to ES INDEX" cron job is currently pushing combined summary documents to the Elasticsearch index with incorrect URLs. Specifically, the url and thread_url fields in each summary document are being set to the link of the last reply in the resource, rather than the correct, original link.

Background

The URLs in question are generated in the read_xml_file method as part of the XML processing workflow. This issue originates from the "XML Generation" cron job, which generates these XML files with the incorrect url here.

Proposed Solution

Review the URL generation logic in the XML generation cron job to ensure that link points to the correct resource link rather than the last reply in the resource. The link that we want here is actually the thread_url.

urvishp80 commented 4 days ago

@kouloumos based on your Proposed Solution for the script here I'm thinking of updating the value of url key by the link of main/first post for the given title.

feed_data = {
          ...
          'url': <The link of first/main post (instead of latest post)>,
          ...
          }

kouloumos commented 4 days ago

I'm thinking of updating the value of url key by the link of main/first post for the given title.
feed_data = {
          ...
          'url': <The link of first/main post (instead of latest post)>,
          ...
          } 

Will that result in the thread_url of the combined summary to match the thread_url of the individual documents of the thread? If yes, then proceed wit that.

bitcoinsearch / summarizer

Fix `url` and `thread_url` in Combined Summary Documents #64

Background

Proposed Solution