JaneliaSciComp / msg

Multiplexed Shotgun Genotyping
http://genomics.princeton.edu/AndolfattoLab/MSG.html
11 stars 12 forks source link

Viterbi HMM #35

Open shackett opened 11 years ago

shackett commented 11 years ago

I've made changes to fit-hmm.R and hmmlib.R:

added viterbi paramters passing:

added the ability to run to either or both HMMs changes plotting to make a haplotype plot for each HMM and to label breakpoints and mismatch frequency for each HMM

This changes the output files slightly: 2 plots stacked if both HMMs are selected breakpoints and mismatch frequency (within homozygous regions) are outputted for each HMM and labeled accordingly.

gregpinero commented 11 years ago

Hi Sean,

I'm finally digging into this. So do these changes still give the user the ability to run the old HMM type as before? It looks like fit-hmm is conditional on the hmm type, but I can't tell in hmmlib.

Also what new parameters do I need to pass in? (Sorry if you already mentioned that)

I'm excited to get this in.

Thanks,

Greg

gregpinero commented 11 years ago

Actually I think I figured it out. Is this right for fit-hmm.R:

H - HMMtype (can be ML, or viterbi (or both , separated)
If viterbi
    j - HMM_seqpair
    l - HMM_diffthresh
    e - HMM_decay

Would you mind providing me with short descriptions of what each parameter does (1-3 sentences)? And what some sensible defaults for them would be. That way I can update the documentation too.

Thanks,

Greg

shackett commented 11 years ago

HMM_decay: default should be 0.8-0.9. This is the transition probability in the "memory states" of the viterbi transition matrix (used in the function TransMat nested within HMMeval). Having a low value of this parameter will increase the number of markers that will need be coincident with the new haplotype before a transition is called. A higher value, will mean that drawing a sequence of gentoypes that matches a transition better than the current haplotype is more likely to result in a possibly erroneous transition.

Apparently, I found a workaround that didn't involve HMM_diffthresh and HMM_seqpair when I wrote the haplotype consensus functions last year. These variables should no longer be needed. I think the only place I included them in the pushed version of fit-hmm.R

if("viterbi" %in% HMMtype){
    HMM_seqpair <- as.numeric(opts$j)
    HMM_diffthresh <- as.numeric(opts$l)
    HMM_decay <- as.numeric(opts$e)
    }   

If you have any other questions or sections of my code are unclear and problematic, feel free to email me, Best, Sean

JaneliaSciComp commented 11 years ago

Thanks Sean,  so you think I shouldn't use the other options at all?

Is there anything you think I could add to the documentation about when to use ML vs viterbi? Or is it too complex for a short description?

-Greg

shackett commented 11 years ago

At this time the HMM_diffthresh and HMM_decay can safely be removed. I goofed on including them in the most recent implementation in the first place.

The issues of whether to use the ML or viterbi method are somewhat complex. The viterbi's strength when implemented with the memory states that I used is that, it can call a breakpoint when we first started accumulating evidence that it had occurred. To control inappropriate recombination events, the HMM_decay parameter was added to extend the length of sequence informing a new haplotype before a change is made. This procedure minimizes drift, but shouldn't effect where a real recombination event is called (unless it is very close to the telomere).

One problem with this method is the absence of appropriate confidence intervals in terms of the probability that the called recombination interval captures the true recombination event (previous simulations indicated that these intervals only captured true events about half the time under low-depth).

The ML method has one unappetizing feature: the rfac parameter which is used to shrink the number of expected recombination events. This is similar to the HMM_decay parameter is theory, but seems to be more sensitive and ad hoc, and effects the location of recombination event calls.

In terms of when one method would be superior to the others, the methods should probably be compared under different coverage and depths to see if there are practical differences. I've done some simulations to this effect, which showed that the methods were generally comparable, but things have changed on both models since then.

gregpinero commented 11 years ago

Hi Sean,

When I run only ML I'm getting an error from this line: data <- cbind(data, Pr.z.given.y_ML, Pr.z.given.y_vit)

It says " object 'Pr.z.given.y_vit' not found".

Maybe that line just needs to come out since the previos two lines are doing the same thing (I think).

shackett commented 11 years ago

you're correct, the if statements before this chunk should be all you need.

data <- cbind(data, Pr.z.given.y_ML, Pr.z.given.y_vit)
gregpinero commented 11 years ago

But if I take out

data <- cbind(data, Pr.z.given.y_ML, Pr.z.given.y_vit)

Will it still work when the user specifies both?

gregpinero commented 11 years ago

I tried taking it out and running it. I'm getting this error (I have it just using ML so far, so I'm expecting to get the same results as before?)

Rscript msg/fit-hmm.R -d hmm_data -i indivE2_CAGCCG -s female -o hmm_fit -p 0.01 -q 0.01 -r .000001 -c all -x X -y all -z 0.25,0.5,0.25 -t 1 -H ML -e .8
R.methodsS3 v1.2.0 (2010-03-13) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.7.3 (2010-06-04) successfully loaded. See ?R.oo for help.
Error in data.frame(..., check.names = FALSE) : 
  arguments imply differing number of rows: 0, 1
Calls: breakpoint.width -> cbind -> cbind -> data.frame

I can't find a line in the code that looks like "data.frame(..., check.names = False).

shackett commented 11 years ago

Hi Greg,

I tried to implement your solution so that I could reproduce the problem but I'm having trouble with line 211

 prob = Pr.y.given.z(y=y[,read,drop=F], p=p12, n=N, eps=y[,eps,drop=F], ploidy=ploidy, C=TRUE, dir=dirname(dollar0), chrom=contig, id=indiv)

The error I get is

sh: msg/hmmprobs: No such file or directory
Error in `colnames<-`(`*tmp*`, value = c("Pr(y| par1/par1 )", "Pr(y| par1/par2 )",  : 
  length of 'dimnames' [2] not equal to array extent

I know what this line is trying to do, but I'm not familiar with the coding style involving pipe commands... so its hard for me to fix the problem directly. When I modified this code recently, I wrote a workaround code for this step that created junk data of the appropriate format. Cloning msg again didn't seem to help either.

Putting aside that problem, I may know what the problem is:

Since you've gotten it to run as far as calling the breakpoint width function, the peculiarities of running ML versus viterbi were not a problem. At this point, the viterbi and ML "Pr(", ancestries, "|y)" matrices are equivalent, I may have just messed up calling the appropriate method for determining breakpoints and plotting

I think that the matrix y on line 314 is probably NULL because its names don't match those from dataa, while it should be a matrix of width 3.

There are 2 quick fixes I would try:

line 223:
colnames(Pr.z.given.y_ML) <- paste("ML Pr(", ancestries, "|y)")
replaced by:
colnames(Pr.z.given.y_ML) <- paste("ML Pr(", ancestries, "|y)", sep = "")

line 254:
colnames(Pr.z.given.y_vit) <- paste("viterbi Pr(", ancestries, "|y)")
replaced by:
colnames(Pr.z.given.y_vit) <- paste("viterbi Pr(", ancestries, "|y)", sep = "")

line 314:
y <- contig_data[,paste(hmmRunning, "Pr(", ancestries, "|y)"
replace by:
y <- contig_data[,colnames(contig_data) %in% paste(hmmRunning, " Pr(", ancestries, "|y)", sep = "")]

It that doesn't help, can you save the objects that exist before the error using

save(list = ls(all=TRUE), file = "tmp.Rdata")

and email them to me. If you see a possible reason for my previously explained error, I could fix that and try again as well.

Best, Sean

JaneliaSciComp commented 11 years ago

Hi Sean,

I tried those 3 changes but I'm still getting the same error message.

I've attached tmp.Rdata.

I put the save in this section of hmmlib.R (on the line before the error):

breakpoint.width <- function(x, y1, y2, indiv=NA, contig=NA, conf1=.05 ,conf2=.95, hmmtype) { width1 <- findBreak(indiv,contig,y1,y1>=conf1,y1<=conf2,conf1,conf2,'homozygous_par1') width2 <- findBreak(indiv,contig,y2,y2>=conf1,y2<=conf2,conf1,conf2,'homozygous_par2')         save(list = ls(all=TRUE), file = "tmp.Rdata") list(blocks=cbind(rbind(width1[["blocks"]],width2[["blocks"]]), hmmtype),bps=cbind(rbind(width1[["bps"]],width2[["bps"]]), hmmtype)) }

-Greg

From: Sean Hackett notifications@github.comReply-To: JaneliaSciComp/msg reply@reply.github.comDate: Saturday, October 27, 2012 1:37 AMTo: JaneliaSciComp/msg msg@noreply.github.comCc: Pinero Gregory pinerog@janelia.hhmi.orgSubject: Re: [msg] Viterbi HMM (#35)

Hi Greg, I tried to implement your solution so that I could reproduce the problem but I'm having trouble with line 211

prob = Pr.y.given.z(y=y[,read,drop=F], p=p12, n=N, eps=y[,eps,drop=F], ploidy=ploidy, C=TRUE, dir=dirname(dollar0), chrom=contig, id=indiv)

The error I get is

sh: msg/hmmprobs: No such file or directory Error in colnames<-(*tmp*, value = c("Pr(y| par1/par1 )", "Pr(y| par1/par2 )", : length of 'dimnames' [2] not equal to array extent

I know what this line is trying to do, but I'm not familiar with the coding style involving pipe commands... so its hard for me to fix the problem directly. When I modified this code recently, I wrote a workaround code for this step that created junk data of the appropriate format. Cloning msg again didn't seem to help either. Putting aside that problem, I may know what the problem is: Since you've gotten it to run as far as calling the breakpoint width function, the peculiarities of running ML versus viterbi were not a problem. At this point, the viterbi and ML "Pr(", ancestries, "|y)" matrices are equivalent, I may have just messed up calling the appropriate method for determining breakpoints and plotting I think that the matrix y on line 314 is probably NULL because its names don't match those from dataa, while it should be a matrix of width 3. There are 2 quick fixes I would try:

line 223: colnames(Pr.z.given.y_ML) <- paste("ML Pr(", ancestries, "|y)") replaced by: colnames(Pr.z.given.y_ML) <- paste("ML Pr(", ancestries, "|y)", sep = "")

line 254: colnames(Pr.z.given.y_vit) <- paste("viterbi Pr(", ancestries, "|y)") replaced by: colnames(Pr.z.given.y_vit) <- paste("viterbi Pr(", ancestries, "|y)", sep = "")

line 314: y <- contig_data[,paste(hmmRunning, "Pr(", ancestries, "|y)" replace by: y <- contig_data[,colnames(contig_data) %in% paste(hmmRunning, " Pr(", ancestries, "|y)", sep = "")]

It that doesn't help, can you save the objects that exist before the error using

save(list = ls(all=TRUE), file = "tmp.Rdata")

and email them to me. If you see a possible reason for my previously explained error, I could fix that and try again as well. Best, Sean — Reply to this email directly or view it on GitHub.

shackett commented 11 years ago

Hi Greg,

I solved the problem. The example that we were using didn't have any breakpoints called by the ML method. This resulted in the data.frames within byBlocks being empty. When I cbinded on the type of HMM that called the breakpoint, the error popped. I fixed it with a little loop that checks the dimensions of the data.frame. I checked the code with the no-breakpoint example you gave me as well as when it does call a breakpoint, both worked.

breakpoint.width <- function(x, y1, y2, indiv=NA, contig=NA, conf1=.05 ,conf2=.95, hmmtype) {
    width1 <- findBreak(indiv,contig,y1,y1>=conf1,y1<=conf2,conf1,conf2,'homozygous_par1')
    width2 <- findBreak(indiv,contig,y2,y2>=conf1,y2<=conf2,conf1,conf2,'homozygous_par2')

    tmp <- list()
    if(length(width1[["blocks"]]) + length(width2[["blocks"]]) != 0){
        tmp[["blocks"]] <- cbind(rbind(width1[["blocks"]],width2[["blocks"]]), hmmtype)
        }else{
            tmp[["blocks"]] <- rbind(width1[["blocks"]],width2[["blocks"]])
            }
    if(length(width1[["bps"]]) + length(width2[["bps"]]) != 0){
        tmp[["bps"]] <- cbind(rbind(width1[["bps"]],width2[["bps"]]), hmmtype)
        }else{
            tmp[["bps"]] <- rbind(width1[["bps"]],width2[["bps"]])
            }
    tmp
}   

I also made a slight change to line 318 of fit.hmm.R

if (nrow(byBlocks[["bps"]])>0) { breakpoints <- rbind(breakpoints,byBlocks[["bps"]]); }

from

is.null(byBlocks[["bps"]]) == F

The latter will always evaluate as false because an empty data.frame is not null. This makes no practical difference because rbinding an empty data.frame to the breakpoints doesn't change anything. You can change it if you want.

If you have any other problems running things, capturing the relevant objects in an Rdata file worked well.

-Sean

gregpinero commented 11 years ago

Thanks Sean, that's great. I guess it's easiest if I just copy your changes above and paste them in?

shackett commented 11 years ago

That should work fine, just replace the whole breakpoint.width function and copy that line over

gregpinero commented 11 years ago

Hi Sean,

The good news is that it's now running on the individual that was causing the error before and the generated PDF files appear to be identical! So that's good progress.

However I'm seeing this error:

Rscript msg/fit-hmm.R -d hmm_data -i indivH9_GTCGCG -s male -o hmm_fit -p 0.01 -q 0.01 -r .000001 -c all -x X -y all -z 0.25,0.5,0.25 -t 1 -H ML -e .8 R.methodsS3 v1.2.0 (2010-03-13) successfully loaded. See ?R.methodsS3 for help. R.oo v1.7.3 (2010-06-04) successfully loaded. See ?R.oo for help. Error in FUN(X[[1L]], ...) : subscript out of bounds Calls: get.ancestry.probs ... lapply -> FUN -> sort -> unique -> unlist -> lapply Execution halted

Let me know what information would be useful?

(All of the other individuals are running without errors which is also good)

-Greg

shackett commented 11 years ago

This bug is really confusing.

It looks like this run of fit-hmm.R is failing on get.ancestry.probs which is a function within SummaryPlots.R. It looks to me like this function is run through msgCluster.pl so I'm not sure where in the workflow it is being called. Do you have a line number where the error is coming from.

Have you tried running the "viterbi" and "ML,viterbi" options on the same data to see what the output looks like?

Peter gave me the impression that I should mimic the ML method with the viterbi method in fit-hmm.R in order to generate figures, breakpoints and matchMismatch files that were the same format as the ML ones. Because I had to add on another column specifying the model, it might be failing and subsequent steps and I have no experience with those. If that's the case then we either need to parallelize all subsequent steps or run either the ML or viterbi method and remove the hmmRunning factor.

-Sean

gregpinero commented 11 years ago

I guess this is happening in SummaryPlots.R after fit-hmm.R is run.

A few things I noticed: All of the *-breakpoints.csv (e.g., indivH9_GTCGCG-breakpoints.csv) are missing in the hmm_fit directory. (I'm not sure which program creates those.

Also the *-matchMismatch.csv files (e.g., hmm_fit/indivA12_AATAAG/indivA12_AATAAG-matchMismatch.csv) have an extra column "hmmtype". (which you mentioned)

I'll send you the data I'm using so you're able to do a full run. Maybe that will help? I'm not sure about the best strategy yet ...