clingen-data-model / genegraph

Presents an RDF triplestore of gene information using GraphQL APIs
5 stars 0 forks source link

Actionability curations have missing genetic condition associations #30

Closed tnavatar closed 3 years ago

tnavatar commented 4 years ago

This issue presented as a disparity in data between dev and stage systems, with the stage system missing data that was included in dev. On inspection, it appears that this is the result of bad data being sent over the streaming service from the ACI recently.

Genes for which disparities were identified: PTEN HGNC:9588 CYP21A2 HGNC:2600 STK11 HGNC:11389 BRCA1 HGNC:1100 HNF1A HGNC:11621 APC HGNC:583

sgoehringer commented 4 years ago

@tnavatar I was checking on the status for the missing data so we can continue testing and then review with stakeholders.

tnavatar commented 4 years ago

I'm currently investigating the root cause of this. When I rebuild the database in dev I also observe missing data. Will push a fix to stage once I have an update; if the cause is unit-testable will add tests to the code to prevent regressions.

tnavatar commented 4 years ago

There are many actionability curations that have been pushed to the streaming service with bad data recently. As an example, here's the one for BCRA1, which is one of the genes we're having trouble with. I think the reason that these were present on my dev box and not in stage is that I stopped listening to new actionability updates on my dev machine, while stage has been keeping up with the new updates.

{
  "jsonMessageVersion" : "AV1",
  "statusPublishFlag" : "Publish",
  "type" : "actionability",
  "affiliations" : [
    {
      "id" : "Adult AWG",
      "name" : "Adult Actionability Working Group"
    }
  ],

  "iri" : "https://actionability.clinicalgenome.org/ac/Adult/api/sepio/doc/AC133",
  "title" : "Hereditary Breast and Ovarian Cancer - ",
  "curationVersion" : "1.1.1",
  "statusFlag" : "Released",
  "dateISO8601" : "2019-10-08T00:00:00+00:00",
  "releaseNotes" : "",
  "surveyDetails" : "https://actionability.clinicalgenome.org/ac/ui/stg1RuleOutRpt?doc=AC133",
  "earlyRuleOutStatus" : "Complete",
  "scoreDetails" : "https://actionability.clinicalgenome.org/ac/Adult/ui/stg2SummaryRpt?doc=AC133",
  "genes" : [

  ],
  "conditions" : [

  ],
  "scores" : [
    {
      "Outcome" : "Breast Cancer (BRCA1)",
      "Severity" : "",
      "Likelihood" : "",
      "Interventions" : [

      ]
    },
    {
      "Outcome" : "Ovarian Cancer (BRCA1)",
      "Severity" : "",
      "Likelihood" : "",
      "Interventions" : [

      ]
    },
    {
      "Outcome" : "Breast Cancer (BRCA2)",
      "Severity" : "",
      "Likelihood" : "",
      "Interventions" : [

      ]
    },
    {
      "Outcome" : "Ovarian Cancer (BRCA2)",
      "Severity" : "",
      "Likelihood" : "",
      "Interventions" : [

      ]
    }
  ],
  "searchDates" : [
    "2014-01-28T00:00:00Z"
  ]
}
tnavatar commented 4 years ago

Followed up with Sai on Slack, referenced the comment with the data in this issue.

tnavatar commented 4 years ago

This problem points to the need to have SHACL shapes for incoming data which perform validation and do not import data that fails validation. I believe that this feature would address the above issue, as existing (valid) data would not be overwritten by incoming invalid data, as it appears to be the case in this instance. Will create a separate ticket for this feature.

tnavatar commented 4 years ago

Issue #33 opened to address this concern.

tnavatar commented 4 years ago

I've implemented filtering by SHACL constraints, and there are currently 124 curations in the database. The above genes all now show valid data from GraphQL on the staging instance (ds-stage.clingen.info) According to the list on actionability.clinicalgenome.org, there are 121 curations. @sgoehringer had a larger number (207); please let me know if you believe that number to be correct, or if the lower figure is accurate. If the lower figure is accurate, I believe the extra 3 curations may be from a previous bug in the ACI where curations were published to the production topic with a 'localhost' IRI as opposed to the correct ACI.

tnavatar commented 4 years ago

For reference, the list on the ACI: https://actionability.clinicalgenome.org/ac/

sgoehringer commented 4 years ago

I checked the ACI and the correct number is 121. The 207 was a mistake. Do we know, or did they share the three curations that are a mistake with the localhost?

tnavatar commented 4 years ago

They shared those curations earlier; they’re at the beginning of the queue. I might be able to trim them using some clever topic expiration rules in Kafka; I think they don’t affect the data users can see, but will verify.

On Nov 5, 2019, at 3:11 PM, Scott G notifications@github.com wrote:

I checked the ACI and the correct number is 121. The 207 was a mistake. Do we know, or did they share the three curations that are a mistake with the localhost?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clingen-data-model/genegraph/issues/30?email_source=notifications&email_token=AAC5MMCPEBX4OQFRKR5CSDDQSHHPVA5CNFSM4JFGCMZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDEFC2A#issuecomment-549998952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5MMG3DUSKF3FDWZLZABDQSHHPVANCNFSM4JFGCMZA.

sgoehringer commented 4 years ago

Two other questions. 1 - would it be possible to see when the data issue started? The reason I ask is it would be helpful to confirm anything after that date would have been a message for an update or a removal. This will speed the spot-check process. 2 - Can you see if the data they are sending is still having issues? I believe they released their update and I wasn't sure if they corrected the issue. (Granted, they released the updated Monday so new curations may not have been done yet).

tnavatar commented 4 years ago

1) The first message that fails validation was sent 2019-10-03T14:12:01 2) Short answer is yes, they are still having issues--although there are some messages that have passed validation in-between when this started and now. Two of the last three messages were bad, for example.