WTF??!! LocIdx starts at 1 on every locus, but that seems to break create_integer_genotype_matrix()

eriqande commented 3 months ago

Eric here...

I am running some stuff on the Tobique salmon using a ckmr object that includes the chromosomes (it is 12 microsats, with two of them on the same chromosome). When I do that, everything seems to work fine until I start going about doing the pairwise comparisons (actually...I am getting failures looking for close matching samples).

The issue seems to be here:

https://github.com/eriqande/CKMRsim/blob/b54e32473e60e9fbeb92994e0d168b48379da5f6/R/create_integer_genotype_matrix.R#L49

The problem is that LocIdx is reset for each chromosome, so this ends up not being a matrix with unique Loci in it.

For now, the workaround is to create a ckmr object with everything on the same locus, and then use that when doing the pairwise comps. But that is a huge PITA.

I think this can be fixed like this...

I think that I should be able to just modify reindex_markers() so that it gives each locus a unique serial index throughout the whole genome, so that the numbers don't start up at 1 again, on each new chromosome. The only thing that would change, I think, would be the names of the loci that are used internally (i.e., the chrom.Locus.pos nomenclature. But I don't think that this would break anything. In fact I don't think LocIdx plays into that at all, anyway.

I need to implement this and test it and make sure it is working.

arianacerreta commented 2 months ago

Hi Eric, I found the same issue with some SNP microhap data that I have been working with. It all worked great until I tried to identify close matching samples. I'll let you know what my workaround ends up being if I figure out one.

eriqande commented 2 months ago

Hi ariana,

thanks for pinging me about this. Let me know if no simple workarounds work for you and I can fast-track the fix on this for you next week.

Cheers,

eric

On Wed, Sep 4, 2024 at 1:17 PM arianacerreta @.***> wrote:

Hi Eric, I found the same issue with some SNP microhap data that I have been working with. It all worked great until I tried to identify close matching samples. I'll let you know what my workaround ends up being if I figure out one.

— Reply to this email directly, view it on GitHub https://github.com/eriqande/CKMRsim/issues/9#issuecomment-2329797852, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPQ4JWL2T7Z2QLVVFTKDEDZU5MDLAVCNFSM6AAAAABLXFXQMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRZG44TOOBVGI . You are receiving this because you authored the thread.Message ID: @.***>

arianacerreta commented 2 months ago

Hi Eric,

I was following along your tutorial. Since you mentioned that it might have been the LocIdx that was the problem, I modified the reindex_markers function:

reindex_markers<- function(M){
  M %>% dplyr::ungroup() %>% dplyr::arrange(Chrom, Pos, desc(Freq)) %>% 
    #dplyr::group_by(Chrom) %>% 
    dplyr::mutate(locidx = as.integer(factor(Locus, levels = unique(Locus)))) %>% 
    dplyr::group_by(Chrom, Locus) %>% 
    dplyr::mutate(alleidx = as.integer(factor(Allele, levels = unique(Allele))), newfreq = Freq/sum(Freq)) %>% 
    dplyr::select(-AlleIdx, -LocIdx, -Freq) %>% 
    rename(Freq = newfreq, AlleIdx = alleidx, LocIdx = locidx) %>% 
    dplyr::ungroup()
}

That seemed to give unique identifiers for each unique locus, even when I had loci on the same chromosomes.

The find_close_matching_genotypes still threw an error after this, so I double checked the function create_integer_genotype_matrix and it ran with no problem separately. So, I decided instead of calling create_integer_genotype_matrix within find_close_matching_genotypes, I would save the integer matrix separately. Then, I created the matchers object using the source code for the find_close_matching_genotypes function. My work flow was as follows:

mat_GT<-create_integer_genotype_matrix(long_geno_sub,afreqs_ready)
max_mismatch<-5
matchers <- pairwise_geno_id(mat_GT, max_miss = max_mismatch) %>% 
  dplyr::arrange(num_mismatch) %>% dplyr::mutate(indiv_1 = rownames(S)[ind1], 
                                                 indiv_2 = rownames(S)[ind2]) %>% dplyr::select(indiv_1, 
                                                                                                indiv_2, dplyr::everything())

That is as far as I have gotten so far, but I'm going to keep working through the tutorials you have online with this data. I will let you know if I find anything else.

Ariana

eriqande commented 2 months ago

Thanks for the update Ariana. Also, for a better tutorial, that also discusses some of the things that can be done about physical linkage, please check out: https://eriqande.github.io/tws-ckmr-2022/kin-finding-lab.html

Cheers,

eric

eriqande / CKMRsim

WTF??!! LocIdx starts at 1 on every locus, but that seems to break create_integer_genotype_matrix() #9

I think this can be fixed like this...