AlphaGenes / AlphaPeel

AlphaPeel: calling, phasing, and imputing genotype and sequence data in pedigrees
MIT License
2 stars 11 forks source link

Add rec_prob input and output #144

Open XingerTang opened 6 months ago

XingerTang commented 6 months ago

Let AlphaPeel take an input file of the recombination rate for each locus (possibly with sex information) which is stored as a numpy vector, and generate the corresponding output file.

The test should have the following five subtests:

Subtest Name rec_prob rec_length Description Complete
no_input x x Test the default values of the recombination rate without any input
rec_length x Test the input option rec_length
rec_prob x Test the input option rec_prob
both_same Test the case when both inputs coincide, whether AlphaPeel uses the values given by rec_prob (the more precise one)
both_different Test the case when both inputs do not coincide, whether AlphaPeel raises the warning
gregorgorjanc commented 6 months ago

I believe we already have output for recombination rate / probs per locus - or maybe we wanted to add one, but did not yet implement it?

XingerTang commented 6 months ago

@gregorgorjanc Yes, the output is not implemented. I will do that later.

XingerTang commented 6 months ago

@gregorgorjanc What are you thoughts about the update

gregorgorjanc commented 6 months ago

@XingerTang I think the above update of the tasks makes sense!

I am now blanking on the connection between rec_prob and rec_length since I am not fully on board with the implementation. Let's see, rec_length is given in Morgans for the whole chromosome representing the expected number of recombinations per chromosome. Say we have a 2 Morgan (=200 cM) long chromosome split into 4 segments delineated by 3 markers:

In the above case we need 4 values for rec_prob, but is rec_prob then 0.5 for each segment so that we get the expected number of recombinations equal to 2?

How does the current code go from rec_length input to marker-to-marker rec_prob?

XingerTang commented 6 months ago

@gregorgorjanc The current code is calling the recombination rate transmissionRate, and the real transmission rate is calculated as a part of the calculation of the update of segregation but is not explicitly named. The following is the part that calculates the transmissionRate:

def setupTransmission(length, peelingInfo):
    if peelingInfo.positions is None:
        localMap = np.linspace(0, 1, num=peelingInfo.nLoci, dtype=np.float32)
    else:
        localMap = (
            peelingInfo.positions / peelingInfo.positions[-1]
        )  # This should be sorted. Need to add in code to check.
    for i in range(peelingInfo.nLoci - 1):
        distance = localMap[i + 1] - localMap[i]
        distance = distance * length
        peelingInfo.transmissionRate[i] = distance

Here, length refers to the recombination length input, transmissionRate is the recombination rate. So we can provide the markers positions by using the peelingInfo.positions, even though it is currently not being used and has no input option associated with it. And with this code, the transmissionRate would indeed becomes [0.5, 0.5, 0.5, 0.5] which sums up to 2.

Maybe we need to rename some of the variables and add an input option for PeelingInfo.positions before proceeding to the next step?

gregorgorjanc commented 6 months ago

Transmission rate is indeed an odd name:(

Aren't positions read from the map file (if provided)?

XingerTang commented 6 months ago

@gregorgorjanc Currently, the map file would be read only when we are doing hybrid peeling and a segregation file is provided as an input, then the information of the markers would be extracted while reading the map file, and passed as function arguments instead of being assigned to the PeelingInfo.positions. So we cannot use it elsewhere.

So, maybe we can first make the map file input option always available and then read the information into the PeelingInfo.positions to use the data at different places. Also, we can rename some variables such as transmissionRate to improve the readability of the code.

gregorgorjanc commented 6 months ago

OK, then please open an issue to provide map file for non-hybrid mode too. Thanks!

Do you see any reason why transmissionRate term was used instead of rec...Rate?

XingerTang commented 6 months ago

@gregorgorjanc Sure, I will open the issue.

I read the old code on the recomb branch, and there were both transmissionRate and recomb as the attributes of PeelingInfo. From where the transmissionRate is being used in the calculation of segregation (CollapsePointSeg in Peeling.py) I am sure it is just the recombination rate, whilerecomb is used to store the value of the estimated recombination rate, but I didn't find anywhere in the code that recomb is being used as a part of the calculation of segregation or any value relevant to the peeling. Since the old code is broken so I think maybe it is just an inconsistency in the naming.

gregorgorjanc commented 6 months ago

Great that you are studying the code so much. Will email AW!