Clinsig filter not working if variant has multiple values

bjhall commented 5 years ago

If a variant has multiple clinsig values, for instance "Pathogenic/Likely pathogenic", it cannot be found using the CLINSIG-filter.

Multiple clinsig.values are stored in the database as a comma-separated string, like this: "Pathogenic, Likelypathogenic": https://github.com/Clinical-Genomics/scout/blob/e531c3b15769672b0b6ad92e5d75ecd0ac9c5f8a/scout/parse/variant/clnsig.py#L42

When the filter is applied (with Pathogenomic and/or Likely pathogenic checked) it matches the string clinsig.value "Pathogenic, Likely pathogenic" against the list of selected values ["Pathogenic", "Likely pathogenic"], which will not match.

Responsible code: https://github.com/Clinical-Genomics/scout/blob/e531c3b15769672b0b6ad92e5d75ecd0ac9c5f8a/scout/adapter/mongo/query.py#L292-L296

I'd be happy to fix this myself but I would need some input!

My first attempt was to store the values in the database as a list, but that breaks things in other places. Alternatively, I guess one could do some weird regex matching in the mongo query, but that seems suboptimal...

northwestwitch commented 5 years ago

Hi and thanks for noticing! This problem is not caused by variants having multiple clinsigs, but because the numerical clinsig values are treated as strings in the query downstream.

Try to change the code in scout/scout/adapter/mongo/query.py, line 305, into: rank.append(int(item))

If it works you can create a pull request that fixes the bug! :)

bjhall commented 5 years ago

Thanks for the response!

I tested your suggested fix and unfortunately it did not solve my problem. We are not using numerical clinsig values. I'll try to be more specific:

In the VCF, the CLNSIG-annotation for the variant looks like this (toplevel INFO): ...CLNSIG=Pathogenic/Likely_pathogenic;...

After the vcf is loaded into scout the clnsig field for the variant looks like this in the database:

        "clnsig" : [
                {
                        "revstat" : "criteria_provided, multiple_submitters, no_conflicts",
                        "accession" : 265359,
                        "value" : "Pathogenic, Likely pathogenic"
                }
        ],

which is expected from reading the code. Note that the "value" is a comma-separated string, rather than a list.

Finally when I apply a CLNSIG filter a list named "rank" is created and used in this code section: https://github.com/Clinical-Genomics/scout/blob/a69a287eeae847983f4d331042601cd0c9231157/scout/adapter/mongo/query.py#L268-L287

...and if I understand it correctly that list will look like this if I selected Pathogenic and Likely Pathogenic:

rank = [4, "Likely pathogenic", 5, "Pathogenic]

and then in the mongo query it will do:

'value': { '$in': rank }

and since the value for this variant is the string "Pathogenic, Likely pathogenic", it won't be found in the rank-list. If the value was a list, I suppose it would have worked.

Sorry for the long explanation, I'm just trying to be as clear as possible. These filters are quite tricky to debug, so I may still be on the wrong track, but I feel that I've narrowed down the bug.

bjhall commented 5 years ago

I suppose that you are still using the old ClinVar format? I read the parsing code in scout/scout/parse/variant/clnsig.py again and realized that the clinsig-field is constructed very differently for the old ClinVar-version (with numeric values) than for the new one.

I can probably make an attempt at fixing the parsing code for the new Clinvar-format tomorrow, unless the person who wrote that code objects :)

northwestwitch commented 5 years ago

Ok then it was another bug that we've fixed.

regarding the points above I haven't written this part of the code but I think that the issue is in

scout/parse/variant/clnsig:

In your case the code enters line 30 (if isinstance(acc, int)). In our case it goes in the else (line 46).

In your database you shouldn't have strings like "value" : "Pathogenic, Likely pathogenic" but different clinsigs should be different elements of the variant 'clnsig' array. To fix it one should probably loop over the elements of sig_groups and create a new clnsig_accession for each of them.

It doesn't matter if the clnsig.value is a number of a string, they should be both recognized when you filter.

I can help you with this tomorrow!

northwestwitch commented 5 years ago

We might have the same issue with our variants: #695

bjhall commented 5 years ago

Thanks Chiara for being so responsive! Your commit (5643e6f) should fix the problem described in this issue. However the same problem remains for the REVSTAT-filter. Not really a problem for us at the moment, since we're not really using that filter.

As @moonso is implying in #695, the new ClinVar-format does not really fit with how Scout is storing the ClinVar-data, since the CLINSIG and REVSTAT are no longer "connected", they just list all the values for the variant. However, the last release for the old format is from 2017 so I suppose Scout needs to embrace the new format...

Maybe something like this?

"clnsig_v2" : {
    "revstat" : ["criteria_provided", "multiple_submitters", "no_conflicts"],
    "accession" : 265359,
    "value" : ["Pathogenic", "Likely pathogenic"]
}

I.e. just have a single dictionary, and have all the revstat and clnsig values in lists. But I have no idea to implement this in a backwards-compatible way.

Any thoughts?

northwestwitch commented 5 years ago

Hi, I think the easiest would be to loop over the value string and create as many clinsig array elements as the terms separated by comma. I'm trying to address this in #1163. You could maybe test this branch and see it works?

northwestwitch commented 5 years ago

The idea is to fix the query so instead of being like this:

{ 'clnsig.value' : { '$in' :  ['Pathogenic', 'Likely pathogenic']}}

You'll have this:

{'clnsig' : { 
       '$elemMatch' :  { 
               '$or' : [
                      { 'value' : { '$in' : ['Pathogenic', 'Likely pathogenic']}},
                      { 'value' : re.compile('|'.join(['Pathogenic', 'Likely pathogenic'])) # regular expressions
               ]
        }
   }

One could do the same to fix the revstats

bjhall commented 5 years ago

I tested the branch and it seems to work!

However, I am not sure what the expected behaviour is when both selecting for instance "Pathogenic" in the CLNSIG-select and checking the "CLINSIG Confident"-box. I expected to see variants which were Pathogenic and had high confidence REVSTAT. Instead it appears that I get any high confidence REVSTAT regardless och CLNSIG-value.

Maybe that's the expected result? In that case you can probably close this issue now!

northwestwitch commented 5 years ago

Ok then we need to add an $and instead of an $or somewhere because I don't think it's the expected result..

northwestwitch commented 5 years ago

I've pushed an update to the branch so it should combine the revstat + clinsig value for filtering. Could you give it a try when you have time?

bjhall commented 5 years ago

Works perfectly! Thanks for bearing with me :)

northwestwitch commented 5 years ago

You're welcome, a problem of a user is everybody's problem with a software ;) Thank you for for notifying the bug!

dnil commented 5 years ago

Well found, fought and fixed! Expected output indeed Pathogenic and high confidence revstat, although this may be rather confounded by variants with multiple unordered values for CLNSIG and REVSTAT. Fingers crossed for the test on a heavily populated db.

Clinical-Genomics / scout

Clinsig filter not working if variant has multiple values #1160