Closed ozgunbabur closed 2 years ago
Using an array indexed to the UTF-8 encoding of the letters is easier and faster.
It may be harder to debug the code that way. But it is your choice. If you implement it as a number array, please also implement the necessary converters between number array and string representations, and test them to make sure they are working properly.
See issue of one amino acid and one position.
Please complete the issues #7 and #8 before working on this one.
The method will take the ranked sequences, calculate all the enrichments and deficiencies, and return the results in a special format.
We will use the following coordinate system on a sequence: The center aa (amino acid) is position 0. This number will increase as we go right and it will decrease as we go left. For instance, if we have window 3, then the coordinates of the positions from left to right will be -3, -2, -1, 0, 1, 2, 3.
Here are the inputs and outputs.
Input: ranked sequences Output: A dictionary of dictionaries that gives enrichment and deficiency p-values for each position and each aa
The method first needs to call the method that calculates the counts of each amino acid at each location (see issue #7). Then, for each location and an aa, it should call the p-value calculation method that returns enrichment and deficiency p-values for one location and one aa (see issue #8). Finally, the results should be arranged as a dictionary of dictionaries.
An example output is below.
{-2: {"A": [0.012, 0.87], "F": [0.54, 0.23]}, -1: {"A": [0.23, 0.07], "P": [0.91, 0.003], "S": [0.6, 0.12]}, ...}
Here, we understand that the enrichment p-value of P on position -1 is 0.91 and its deficiency p-value is 0.003.