Open DanLep97 opened 2 years ago
This is the distribution of duplicates labels. There are a total of 187,485 entries. "same" means if all duplicates would be classified as binders/non-binders with a 500 nm cutoff. From this you could conclude that 2% of the data is noisy. Though you could also say that this is an indication that 12% (2/18: different/same) of the data could have noise. In other words we are not able to say that all the non-duplicated entries are always correct because we have no duplicates to corroborate that.
Some entries in the MHCflurry database have different labels for identical peptide-MHC complexes.
From the MHCflurry database, the EAAGIGILTV peptide has different measurements for the same allele:
There are a lot of cases like this one.