McGill-CSB / PHYLO

a gaming framework to align genomic data
phylo.cs.mcgill.ca/edge
Other
11 stars 14 forks source link

Server should validate user-submitted solutions #119

Closed movermeyer closed 6 years ago

movermeyer commented 6 years ago

For example, the first puzzle of "Other diseases" has a top score of 360, which I believe is not possible.

possible_false_top_score

Even if every base was a match, there are only 166 (8 20 + 23) bases in the puzzle. Unless I'm mistaken in my scoring, the maximum possible score is 166 * 2 matches = 332.

I suspect that this is a result of the server not verifying solutions that users submit.

The server should verify solutions to ensure that:

waldispuhl commented 6 years ago

It is likely because the puzzle is old and we used a different scoring scheme at that time. We'll retire it soon. old puzzles were used to see how participants are performing over a long period of time.

We are actually about to release in upcoming days a brand new database system with new puzzles.

movermeyer commented 6 years ago

I ran a bit of an experiment.

Using Fiddler, I intercepted the response from /api/getPuzzlesByC&D/ and modified the puzzle I was given. This was the response after I modified it for the first puzzle of the "Heart And Muscles" category.

{
    "_id": "597badf1cadefb63b8e53a76",
    "sequence_id": 33,
    "submitter": "Akash",
        "sequence": ["-------------------------","-------------------------","-------------------------","-------------------------","-------------------------","-------------------------","-------------------------","-------------------------","-------------------------"],
    "tree": "((((hg19,rheMac2),(mm9,rn4)),(bosTau4,(equCab2,canFam2))),(loxAfr3,dasNov2))",
    "disease_link": "Heart and Muscles",
    "difficulty": 1,
    "category": "Heart and Muscles",
    "motif_seq": ["GCAGGTGTGA", "TGCAGGTGTG", "TTGCAGGTGT", "TGGGGGTGGGGG", "GGTGGGGG", "GGGTGGGG", "TTGCAGGTGTGA", "AGGTGTGA", "TTGGGGGTGGGG"],
    "par_score": [0,0,0,0,0,0,0,0],
    "annotations": ["E2A(bHLH)/proBcell-E2A-ChIP-Seq(GSE21978)", "E2A(bHLH),near_PU.1/Bcell-PU.1-ChIP-Seq(GSE21512)", "HEB(bHLH)/mES-Heb-ChIP-Seq(GSE53233)", "KLF14(Zf)/HEK293-KLF14.GFP-ChIP-Seq(GSE58341)", "Maz(Zf)/HepG2-Maz-ChIP-Seq(GSE31477)", "Maz(Zf)/HepG2-Maz-ChIP-Seq(GSE31477)", "Slug(Zf)/Mesoderm-Snai2-ChIP-Seq(GSE61475)", "Tbx5(T-box)/HL1-Tbx5.biotin-ChIP-Seq(GSE21529)", "Zfp281(Zf)/ES-Zfp281-ChIP-Seq(GSE81042)"],
    "highest_score": 28,
    "location_offset": 214,
    "end_offset": 235,
    "gene": "15",
    "failure_rate": 1
}

Note that I have modified the tree to add more sequences (this puzzle normally has only 3), set the par_score to 0, and made completely empty sequences.

This gave me a game that looked like:

empty

When I hit submit, the server happily accepted my modified problem solution. You can see it saved the solution in OpenPhylo:

empty_solution_in_openphylo

To me this proves that the server does not validate the solutions presented by users. A malicious user could use this to steal the top scores for each of the puzzles, by filling puzzle with many sequences that are all "TTTTTTT...." (Or whatever the optimal alignment is). Not only that, but these top scoring alignments would presumably pollute the alignments that the researchers receive, lowering the data quality and causing headache when they have to debug the data problem.

I tried to avoid causing damage in this experiment. I submitted a score of 0, which ideally would not be passed on as a valuable alignment.

movermeyer commented 6 years ago

I'm looking forward to seeing the new puzzles. Seems like its an exciting time for PHYLO, with all these changes happening!

waldispuhl commented 6 years ago

good catch. I'll forward that to our dev. it should be fixed indeed. Thanks!

akashzcoder commented 6 years ago

We have added the solution submission validation on the server side. Thank you for notifying us with this!