IATI / refresher

A Python application which has the responsibility of tracking IATI data from around the Web and refreshing the core IATI software's data stores
GNU Affero General Public License v3.0
2 stars 0 forks source link

Possible off-by-one or de-duplication error in transaction explosion #266

Closed akmiller01 closed 10 months ago

akmiller01 commented 1 year ago

Brief Description The Unified Platform transaction records for a particular activity are missing one transaction. IATI identifier is XI-IATI-EC_ECHO-ECHO/-AF/BUD/2018/92048. Examination of the underlying XML shows there should be two identical transactions of 60,000 EUR:

image

But an API query to /datastore/transaction/select?q=iati_identifier:"XI-IATI-EC_ECHO-ECHO/-AF/BUD/2018/92048"&fl=transaction_transaction_date_iso_date,transaction_value shows only one.

{
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":3,
    "params":{
      "q":"iati_identifier:\"XI-IATI-EC_ECHO-ECHO/-AF/BUD/2018/92048\"",
      "fl":"transaction_transaction_date_iso_date,transaction_value"}},
  "response":{"numFound":8,"start":0,"numFoundExact":true,"docs":[
      {
        "transaction_value":[980000.0],
        "transaction_transaction_date_iso_date":["2018-10-08T00:00:00Z"]},
      {
        "transaction_value":[300000.0],
        "transaction_transaction_date_iso_date":["2018-10-08T00:00:00Z"]},
      {
        "transaction_value":[300000.0],
        "transaction_transaction_date_iso_date":["2018-10-08T00:00:00Z"]},
      {
        "transaction_value":[544000.0],
        "transaction_transaction_date_iso_date":["2018-10-26T00:00:00Z"]},
      {
        "transaction_value":[240000.0],
        "transaction_transaction_date_iso_date":["2018-12-14T00:00:00Z"]},
      {
        "transaction_value":[240000.0],
        "transaction_transaction_date_iso_date":["2020-05-13T00:00:00Z"]},
      {
        "transaction_value":[136000.0],
        "transaction_transaction_date_iso_date":["2022-12-22T00:00:00Z"]},
      {
        "transaction_value":[60000.0],
        "transaction_transaction_date_iso_date":["2022-12-22T00:00:00Z"]}]
  }}

The data is correct at the activity level /datastore/activity/select?q=iati_identifier:"XI-IATI-EC_ECHO-ECHO/-AF/BUD/2018/92048"&fl=transaction_transaction_date_iso_date,transaction_value

  {
  "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":2,
    "params":{
      "q":"iati_identifier:\"XI-IATI-EC_ECHO-ECHO/-AF/BUD/2018/92048\"",
      "fl":"transaction_transaction_date_iso_date,transaction_value"}},
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "transaction_value":[980000.0,
          300000.0,
          300000.0,
          544000.0,
          240000.0,
          240000.0,
          136000.0,
          60000.0,
          60000.0],
        "transaction_transaction_date_iso_date":["2018-10-08T00:00:00Z",
          "2018-10-08T00:00:00Z",
          "2018-10-08T00:00:00Z",
          "2018-10-26T00:00:00Z",
          "2018-12-14T00:00:00Z",
          "2020-05-13T00:00:00Z",
          "2022-12-22T00:00:00Z",
          "2022-12-22T00:00:00Z",
          "2022-12-22T00:00:00Z"]}]
  }}

Severity High

Issue Location /datastore/transaction/select?q=iati_identifier:"XI-IATI-EC_ECHO-ECHO/-AF/BUD/2018/92048"&fl=transaction_transaction_date_iso_date,transaction_value

Steps to Reproduce Add a list of actions needed to replicate the error. Steps to reproduce the behavior:

  1. Visit Datastore API or Datastore Search.
  2. Query transaction SOLR core
  3. See missing transaction

Expected Results/Behaviour Two identical transaction rows for 60,000 EUR each.

Actual Results/Behaviour One transaction row for 60,000 EUR.

odscjames commented 1 year ago

Think this is the culprit: https://github.com/IATI/refresher/blob/develop/src/library/solrize.py#L311

         doc['id'] = utils.get_hash_for_identifier(json.dumps(doc))

The document in the lake still has both transactions (id 83d4df71ad40ab2b40e08fcfa7d96c9c86c00c2d ).

However when we come to run the final solrize stage, that line generates a SOLR id by just hashing the data, so 2 data elements which are exactly the same will get the same SOLR id.

Note that very same activity has 2 transactions for 30,000 but becuase they are slightly different you can see 30,000 twice in the screenshot above (they have different receiver-org details)

odscjames commented 11 months ago

now on develop for testing

odscjames commented 10 months ago

The example in this bug report is now correct in production data store