Variant search fails when there are transcripts without geneIds

ShifaSZ commented 1 year ago

Describe the bug A clear and concise description of what the bug is, and what you instead expect to be happening When searching variants on the family BEG_916 of project CMG_Beggs_WGS with a location chr20:45824623-45827445 filter, the search fails and shows an error message of '<' not supported between instances of 'NoneType' and 'str'. The error can be reproduced with the dev branch. I can narrow down the cause for the error because the function create_json_response() can't handle the None key in the transcripts of the variants in response['searchedVariants']. I believe the None keys are caused by missing geneId for the transcript.

Link to page(s) where bug is occurring https://seqr.broadinstitute.org/variant_search/results/23900a15986795797a094246a88bc4d0?page=1&sort=pathogenicity with a location filter chr20:45824623-45827445.

Scope of the bug Are you experiencing this issue in all projects? Within specific families within one project? I didn't check other projects.

Screenshots If applicable, add screenshots to help explain your problem.

ShifaSZ commented 1 year ago

I ran a search below and got 567 hits.

GET r0485_cmg_beggs_wgs__wgs__grch38__sv__v02__vcfv04__20230615/_search
{
  "query": {
    "nested":{
      "path":"sortedTranscriptConsequences",
      "query": {
        "bool": {
          "must_not": {
            "exists" : {
              "field" : "sortedTranscriptConsequences.gene_id"
            }
          }
        }
      }
    }
  },
  "_source": ["contig","start", "sortedTranscriptConsequences"]
}

Some typical hits are as the followings.

        "_source" : {
          "sortedTranscriptConsequences" : [
            {
              "gene_symbol" : "H3-2",
              "gene_id" : null,
              "major_consequence" : "NEAREST_TSS"
            }
          ],
          "start" : 143192770,
          "contig" : "1"
        }
        ...
        "_source" : {
          "sortedTranscriptConsequences" : [
            {
              "gene_symbol" : "ABC7-42404400C24.1",
              "gene_id" : null,
              "major_consequence" : "NEAREST_TSS"
            },
            {
              "gene_symbol" : "NUTM2A",
              "gene_id" : "ENSG00000184923",
              "major_consequence" : "NEAREST_TSS"
            }
          ],
          "start" : 47733233,
          "contig" : "10"
        }

The index r0485_cmg_beggs_wgs__wgs__grch38__sv__v02__vcfv04__20230615 is for structural variants. I don't see a previously loaded index with a null gene_id. Although we don't fix the bug by fixing the data, I want to confirm if the data is correctly loaded.

hanars commented 1 year ago

Thanks for investigating, this is really helpful information! @bpblanken looks like a change to the pipeline was introduced in the latest round of loading, I think in the past we maybe fully filtered out transcripts with no matched gene id? Or possibly this was a change to the data we were delivered? Either way, since at this point we have loaded SV data for like 100 projects I think changing this back and reloading is more effort than it is worth and the solution is to have seqr handle this case for now. We might want to think about how we handle this case in the v3 pipeline, but for now lets include these transcripts and see how the users react to them. Also, @ShifaSZ once you fix this we should confirm that the way these transcripts are shown to users is okay, we may need to tweak the ui

bpblanken commented 1 year ago

There was indeed a behavior change here! The new pipeline will allow genes that are not present in the gencode mapping, whereas the old pipeline would hard fail but not filter transcripts that don't match. Looking at my commit history, I think this was actually pseudo-intentional... as in, there was an intentional commit to switch the gencode_mapping[gene] access to gencode_mapping.get(gene, hl.missing(hl.tstr)) in order to get the test SV callset to run, but I don't think the commit was fully thought through.

hanars commented 1 year ago

It happens! We can build a work around in seqr for it so its not a huge deal

bpblanken commented 1 year ago

What should the behavior here be?

1) Should we filter missing genes out? 2) Should we update gencode to see if that resolves? (I think this is a heavier lift than just updating the version number and maybe requires communication with the analysts?) 3) Should revert to hard-failure?

hanars commented 1 year ago

So we are not going to reload SVs now, so the behavior for the time being will be to maintain the change and have seqr decide how to handle these. I do think theres a real chance that this is an issue of a mismatch between the version of gencode that was used in the calling pipeline and the version we used in our loading pipeline, and I think reverting to hard failure would force us to have conversations with the Talkowski group at the point where they deliver the data. I think in the past we used a different version of gencode for SVs and SNPs and had no real issues with that, but also maybe we made them redeliver with a different genocode version

My one concern is that as we start accepting SVs from other groups, how will we handle using the correct gencode version? I think this might be something we actually want to discuss with the Talkowski group and come up with a resilient solution to, athough I think hard failure is a good interim solution (although not an urgent change)

larrybabb commented 1 year ago

@hanars (please check that I'm specifying the plan clearly here)... Verify that there is a consistent approach to recreating the bug (demo/doc that). Demonstrate the variants that are coming back with the new model that used to be in seqr and what it looks like now to get sign off from the users.

hanars commented 1 year ago

No sign off needed from users, sign off needed from me

broadinstitute / seqr

Variant search fails when there are transcripts without geneIds #3461