Natural occurrence estimation and correction

cerebis / qc3C

Reference-free quality assessment for Hi-C sequencing data

GNU Affero General Public License v3.0

13 stars 1 forks source link

Natural occurrence estimation and correction #35

Closed cerebis closed 4 years ago

cerebis commented 4 years ago

https://github.com/cerebis/qc3C/blob/26cbf7de5e407502b9f97f272e263d5f5c8059b0/qc3C/kmer_based.py#L363-L369

Was this meant to be equally likely, rather than zero?

When you eventually use this dict, the probs are all zero and numpy.random.choice will throw an exception of the list of probabilities does not sum to 1.

https://github.com/cerebis/qc3C/blob/26cbf7de5e407502b9f97f272e263d5f5c8059b0/qc3C/kmer_based.py#L373-L379

What I think you're trying to do is create a list of all possible k-mers, then draw a random set from that with uniform probability?

Afterwards, each of these starting k-mers is expanded on using the trans_prob dict?

cerebis commented 4 years ago

Note: I am working on the code, so no need to make commits, just explain here if you can.

cerebis commented 4 years ago

Also, I think the np.random.choice should be done with replacement?

koadman commented 4 years ago

kmer_probs is a poor name for that variable... it never gets used as a probability distribution with np.random.choice or anywhere else, instead it's just meant to create a list of all kmers that have the same length as the junction seq. At the very least it should be renamed to something like kmer_list or alternatively this section could be rewritten in a way that's more pythonic. As you know my python is still n00b level.

and yes, np.random.choice should be done with replacement. Isn't that the default?

cerebis commented 4 years ago

https://github.com/cerebis/qc3C/blob/26cbf7de5e407502b9f97f272e263d5f5c8059b0/qc3C/kmer_based.py#L324-L343

In this method, the loop that normalises counts (lines 342-343) isn't used later. Is this just lingering code, or was it intended to be used in the normalisation of transition probabilities?

koadman commented 4 years ago

yes, it's meant to normalize the transition probs. trans_prob gets returned by the function and then used on line 385: https://github.com/cerebis/qc3C/blob/26cbf7de5e407502b9f97f272e263d5f5c8059b0/qc3C/kmer_based.py#L385 to select the next nucleotide in the simulated sequence, based on the current k-mer, but certainly possible that there's something subtle that causes it to not work as intended

koadman commented 4 years ago

in fact I see a bug already on that line of code -- the index into sim_seq should instead be a range from i-k to i-k-1

cerebis commented 4 years ago

Yes, I encountered that bug earlier today, as I have been reworking and testing this new code.

There's definitely some performance issues to sort out, as currently the step of estimation is much slower than I would have expected. I haven't yet done any profiling.

In the current version, I have split creating trans_prob from the reads and estimating junction probability. Largely, because I want to test them independently.

RIght now, I find that the estimated probs for an 8-mer junction are lower than I would have thought. There is possibly a problem with the code, as you can see from my questions that there were some issues. I may have intuited the wrong answer.

I'll check it in shortly, getting called away.

cerebis commented 4 years ago

I have finished an implementation for this issue. The code can be checked out as the HEAD of multidigest.

Commit: 3a5a310

At present, users are not given any control of the number of reads going into the markov model, nor the number of samples used to estimate occurrence rate. I considered doing a bootstrap here, with a less than exhaustive sample number.

Also, all reads which are read by the parser are used in the model, regardless of whether a user has requested a sub-sampling rate for the k-mer analysis. This is becauase the k-mer work (through Jellyfish) is the bottleneck, and we might as well have as many k-mer observations as possible being fed to the markov model.

I think it might be prudent to have some testing off the coverage rate for the model, when it is built. This could then be passed on or used as a means of generating a warning, if a very small read-set wsas provided.

cerebis commented 4 years ago

This idea has been explored and now abandoned.