Closed damiansm closed 5 years ago
Seb did this but where did we get to with testing it @julesjacobsen and including it in the next db build. I have a feeling you started testing, saw some changes and panicked and we did not take it further?
Yep, that's exactly the case. It 'works' in the sense that data comes out of the matrix, but it's different and I'm not sure if its good different or bad different. In the couple of cases I looked at it seems the links returned were less relevant than before, so I left it. This still needs to be investigated.
I am testing this on a few cases to investigate the differences and confirm they are genuine because of a new improved matrix
Can we think about a more systematic approach? I think we will have more updates like that in the future.
Any suggestions? Would you write code to directly compare the two matrices and identify gains and losses and/or major changes in scores. Hard part would be assessing these make sense in the context of the 2 versions of StringDB rather than just manually using the website. I guess some sort of code that looks at the two downloads of StringDB and makes sure the evidence exists in there for the change?
As an example of what I see on an Exomiser run for 46 previous candidates that had some PPI evidence: 11 are exactly same between the 9_05 and 10 file 25 score about the same but involve a different interacting gene now 8 new hits with the v10 file 2 lost hits with the v10 file
Manual investigation using the StringDB site suggests most makes sense but something systematic is indeed needed to investigate this level of change!
Did spot this oddity though:
Using the old rw file I get a match to a variant in C4A based on proximity to CFI. With the new file I get no PPI hit at all despite there being a new direct link at http://version10.string-db.org scoring 0.9
I was thinking of just automating exactly what you now did manually, i.e. load some variants and see if we still see the variant on the same expected rank(+-10).
But directly comparing the matrices is also an idea. Not sure if makes too much sense, as the resulting values are highly dependent on the number of edges in the graph. Have to have another thought...
Yes - is a tricky one to think about testing completely. I saw one example involving a triangle of interacting genes where last version it had gone one direction and this time favoured the other. Almost like it was walking randomly ;-) Can't remember if we weight the walk by string evidence score - don't think so. Guess with that triangle situation one run it may score gene B and C 0.1501 and 0.1499 and the next run the other way round.
No weight of edges.
I think we should talk on phone/skype!?
This fell through the cracks but I have now run the usual GEL validation and verification set with this version of the rw file and it improves performance slightly. Top 5 performance is same but an additional case gets promoted to top hit. Suggest we add this file to the new 1811_phenotype db.
TODO - produce the .mv version of the file
This is now all tested and confirmed and will be added to 1811_phenotype release. TODO - Jules to change the default in code to this for next release so application.properties does not need to be edited
Seb to build a String version 10 one