Closed cerebis closed 4 years ago
Note: I am working on the code, so no need to make commits, just explain here if you can.
Also, I think the np.random.choice
should be done with replacement?
kmer_probs
is a poor name for that variable... it never gets used as a probability distribution with np.random.choice
or anywhere else, instead it's just meant to create a list of all kmers that have the same length as the junction seq. At the very least it should be renamed to something like kmer_list
or alternatively this section could be rewritten in a way that's more pythonic. As you know my python is still n00b level.
and yes, np.random.choice
should be done with replacement. Isn't that the default?
In this method, the loop that normalises counts (lines 342-343) isn't used later. Is this just lingering code, or was it intended to be used in the normalisation of transition probabilities?
yes, it's meant to normalize the transition probs. trans_prob gets returned by the function and then used on line 385: https://github.com/cerebis/qc3C/blob/26cbf7de5e407502b9f97f272e263d5f5c8059b0/qc3C/kmer_based.py#L385 to select the next nucleotide in the simulated sequence, based on the current k-mer, but certainly possible that there's something subtle that causes it to not work as intended
in fact I see a bug already on that line of code -- the index into sim_seq should instead be a range from i-k
to i-k-1
Yes, I encountered that bug earlier today, as I have been reworking and testing this new code.
There's definitely some performance issues to sort out, as currently the step of estimation is much slower than I would have expected. I haven't yet done any profiling.
In the current version, I have split creating trans_prob from the reads and estimating junction probability. Largely, because I want to test them independently.
RIght now, I find that the estimated probs for an 8-mer junction are lower than I would have thought. There is possibly a problem with the code, as you can see from my questions that there were some issues. I may have intuited the wrong answer.
I'll check it in shortly, getting called away.
I have finished an implementation for this issue. The code can be checked out as the HEAD of multidigest.
Commit: 3a5a310
At present, users are not given any control of the number of reads going into the markov model, nor the number of samples used to estimate occurrence rate. I considered doing a bootstrap here, with a less than exhaustive sample number.
Also, all reads which are read by the parser are used in the model, regardless of whether a user has requested a sub-sampling rate for the k-mer analysis. This is becauase the k-mer work (through Jellyfish) is the bottleneck, and we might as well have as many k-mer observations as possible being fed to the markov model.
I think it might be prudent to have some testing off the coverage rate for the model, when it is built. This could then be passed on or used as a means of generating a warning, if a very small read-set wsas provided.
This idea has been explored and now abandoned.
https://github.com/cerebis/qc3C/blob/26cbf7de5e407502b9f97f272e263d5f5c8059b0/qc3C/kmer_based.py#L363-L369
Was this meant to be equally likely, rather than zero?
When you eventually use this dict, the probs are all zero and
numpy.random.choice
will throw an exception of the list of probabilities does not sum to 1.https://github.com/cerebis/qc3C/blob/26cbf7de5e407502b9f97f272e263d5f5c8059b0/qc3C/kmer_based.py#L373-L379
What I think you're trying to do is create a list of all possible k-mers, then draw a random set from that with uniform probability?
Afterwards, each of these starting k-mers is expanded on using the trans_prob dict?