Dowell-Lab / Tfit

A tool for modeling and annotating RNA polymerase II behavior
Other
0 stars 2 forks source link

Need to determine behavior of "training file" -tss flag in bidir_old module #9

Closed magruca closed 5 years ago

magruca commented 5 years ago

In GitLab by @magr0763 on Sep 28, 2018, 08:45

After the most recent update and pull on 09/24/18, it appears that you are now able to "train" the model in the bidir_old module, or rather calculate what I assume are the same parameters as you would specify in the config file in the model module. However, there are a couple of fixes that would help a lot here:

1) It would be useful if the training data provided did not have to be limited to one chromosome as what I believe is the current behavior. If we cannot get this working, we need to be sure to annotate that this is the current behavior

2) As we plan to annotate chr1, it would be useful if we could print/output the model parameters that are presumably calculated and stored somewhere (presumably in a temp file...?) if we are to run the -mle argument. If we are able to do so, we would then be able to a) perform a rarefaction to determine how much training data would be needed to get "accurate calls" and b) be able to better interrogate how the model module and algorithm is behaving in terms of changes in parameters c) determine whether the config_file.txt that is included in the model includes good estimates for the parameters to call bidirectionals

3) If the behavior is then how I understand it, we need to clarify the -tss parameter, its behavior, and include better/clearer standard default behavior if these annotations are not included

magruca commented 5 years ago

In GitLab by @magr0763 on Sep 28, 2018, 08:48

changed the description

magruca commented 5 years ago

In GitLab by @migo0123 on Oct 2, 2018, 14:38

Right now, it appears that the functions used to compute a preliminary bidirectional model are designed for use with a single segment. In order to work around that, it should be possible to do one of the following:

  1. Compute across all annotated segments and then compute the average of the model parameters estimated.
  2. Perform the same steps as option 1, but weight the average based on the number of annotations in each segment relative to the chromosome size. (this assumes that transcription site density should be somewhat evenly distributed across the entire genome)
  3. Compute across all annotated segments and then choose the best scoring model of all of those tested. (this would be the easiest option)
magruca commented 5 years ago

In GitLab by @magr0763 on Oct 3, 2018, 08:42

After re-reading my initial comment I think it may have been a little bit confusing -- I'll try to clarify and be more explicit to make sure we're on the same page, although this is good information, too.

If you run the bidir_old with the following flags (before the most recent parameter change):

$TFIT bidir_old \ -ij ${BEDPATH}/${ROOTNAME}.tri.BedGraph \ -N ${ROOTNAME}.bidir_old \ -mle 1 \ -fzr 1 \ -o ${PROJECT}/tfit \ -log_out ${PROJECT}/tfit_test/logs/${ROOTNAME}.log \ -tss $TSSFILE

By specifying the -mle 1, you can run the model directly on these preliminary hits in order to obtain final parameter estimates over each preliminary segment that was called/annotated in the bidir_old module. Alternatively, you can leave off the -mle flag and run the model module separately by calling a new step as so:

-ij ${BEDPATH}${ROOTNAME}.trim.BedGraph \ -k $TFITOUT/${ROOTNAME}.bidir_old-1_prelim_bidir_hits.bed \ -N ${ROOTNAME}.tfit_mod_old9 \ -log_out ${PROJECT}/tfit/logs/${ROOTNAME}.tsv \ -o ${PROJECT}/tfit/model/\ -config $CONFIGFILE

Both of these will produce final model parameter estimates for each bidirectional preliminary hit to generate the final bidirectional predictions. It seems to me that the -tss file in the bidir_old would then be used to calculate the same paramters (lambda, sigma, pi, w, mink, maxk, ??maybe alpha/beta?? etc.) as seen in the -config file in the model module. That said, the documentation specifies the config file use with the bidir module, however changing any of these parameters does have an effect on the model module if it is used there, too, despite the lack of documentation for it.

My thought was that the -tss would be provided in the bidir_old module to generate estimates for all of the EM parameters listed in the config. After re-reading the documentation, this is even less clear to me. In running the bidir_old with the -mle parameter with two different settings for -tss 1) REFSEQ and 2) TRAINING_FILE (hand annotations of bidirecitonals), it appears to impact both the prelim and model estimates.

All of this said, the prelim bidir_old behavior in generating preliminary bidirectional hits is more or less fine in my opinion. It calls large regions which is fine as long as it is fed into the model in either fashion. My real question is that if the -tss is used to generate the EM parameters used in the model (which is appears to be), can we somehow obtain print these estimates for to address the questions listed above and therefore see how the -tss would affect the estimates for mink, maxk, elon, alpha, etc.?

This is probably not a straight-forward process and I imagine this will dig up quite a few potential issues which I also imagine may not get addressed in this round of patching things up.

magruca commented 5 years ago

In GitLab by @dowellde on Oct 7, 2018, 15:10

Technically there is no training in Tfit. We should be thinking of Tfit's behavior as two step (1) decide interesting regions; (2) fit the model in these regions.

Of course, it isn't really that simple. The model fits are strongly influenced by the initial conditions of the EM. So Joey experimented with different ways of deciding what were good starting conditions.

For the genome research paper -- the template matching pre-filter (bidir module) is essentially using a "generic" set of parameters to scan the genome for things that "look like" a bidirectional. So we define "looks like" by the average/typical parameters obtained by fitting to a set of regions -- in the paper this was essentially a set of annotated genes.

So I'd expect (but it may not be true) for the TSS file to alter the bidir module's output. Of course, that may have been the original intent and it either went unimplemented or was hard-coded around it. In other words, I don't know what the -tss file is actually doing (need to dig through code to figure this out) ...

I can say that the code doesn't seem to actually READ the config file anywhere (but I haven't done a fully exhaustive look for the reader function) ... suggesting the config file is simply a red herring.

magruca commented 5 years ago

In GitLab by @magr0763 on Oct 8, 2018, 08:07

Michael and I looked into this a little more on Thursday, and also did not find a straight-forward answer.

It seemed that certain values could be changed based on the -tss provided, but others were hard-coded in. In looking at the model estimates between using the TSS file and the alternative "training/annotation file" (I understand this isn't entirely correct), it appears only sigma and beta (foot_print) are changing. The pre-filter settings do not appear to be changing, at least according to the printed output.

I'm not sure if the bidir_old reads the config file anywhere, but I know if you input the config file into the model module, changing those parameters does change your output. This is also bizarre because it appears that only lambda, sigma, and I believe alpha_0 are the only values are not hard-coded. I'm trying to remember the specifics of this -- maybe Michael @migo0123 can speak better to that as I think he continued to look into it. That said, when I was working with the Sasse/Geber data, I ran about 100 different iterations, and changing any/all MLE parameters changed the output -- most significantly lambda, sigma, pi, beta (foot_print), alpha[x] & beta[x] under Map Estimation Parameters, min/max k, and ct.

I'm trying to remember why I was specifying the config file to the model in the first place -- I believe either someone told me to do so or I got an error when I didn't. Either way, as I expected, this is not a straight-forward fix. As such, if we determine that most of these are hard-coded, have the tendency to default, and/or do not significantly change the output, I propose to remove the config file as an option, at least for the time being. Even still after all of the runs I do not fully understand what changing all of the arguments actually does to the behavior of the model and therefore removing it I believe would improve general usage.

magruca commented 5 years ago

In GitLab by @dowellde on Oct 9, 2018, 09:31

If changing the config file changed behaviors, then it is being read somewhere. Will need to take a deeper dive into the code to get to the bottom of this one.

magruca commented 5 years ago

In GitLab by @migo0123 on Oct 9, 2018, 13:37

TSS files, as they stand in the code right now, behave as follows:

Common steps:

  1. The TSS file is read and stored as a vector of segments. -> load_intervals_of_interest_pwrapper()
  2. The chromosome for which the TSS file has the most annotations is determined. -> determine_greatest_tss_coverage()
  3. The vector of segments is then transformed into a chromosome-indexed map of vectors of segments. -> convert_segment_vector
  4. Each TSS segment is then filled with corresponding input bedgraph data. -> insert_bedgraph_to_segment_joint
  5. Each TSS segment is internally binned/normalized. NOTE: this binning and normalization is only performed WITHIN that segment, not across segments. -> BIN()

Now that there exists a clean input dataset:

  1. Compute an "average model" using the TSS intervals. This effectively involves using the model to fit a subset of TSS intervals (this was that chromosome selection that was originally hard coded to "chrX"). During model fitting, 5 different initial models are generated, and the best coring of the set is returned.

I think that average model computation has a number of problems as implemented here. A new classifier and segment are created for each of these iterations. No real fitting is done, as the average model computation function specifies that only one iteration of the EM model is acceptable.

  1. Use the parameters provided or computed to find bidirectional-like regions in the input data.

  2. Write these regions to a file.

  3. (optional) Call the model module on the regions found.

magruca commented 5 years ago

In GitLab by @magr0763 on Nov 2, 2018, 14:46

So the behavior is a little bit unclear here still, but it does affect some of the parameters the model step. It has no impact on the bidir output unless -mle is specified (then it impacts the actual model output), so I have removed this from documentation in bidir as we will be separating the two functions moving forward. That said, I added support to have it included as an optional argument in the model module itself and have it re-estimate the model parameters based on the input file. This does seem to help (slightly -- a lot of parameters appear to still be hard-coded), but I think this is a logical addition rather than having the user try to specify the 19 some-odd parameters in config file, they can optionally provide "training type regions" to fine-tune what is already essentially a functioning model.

With this, I am going to close the issue for now as further refinement will happen at a later date and it at least appears to function in a logical manner. The arguments in model for this are -bd/--bidirs.

magruca commented 5 years ago

In GitLab by @magr0763 on Nov 2, 2018, 14:46

closed