cbg-ethz / shorah

Repo for the software suite ShoRAH (Short Reads Assembly into Haplotypes)
GNU General Public License v3.0
41 stars 14 forks source link

posterior fraction > 1? #14

Closed bramvrancken closed 9 years ago

bramvrancken commented 9 years ago

Dear Dr. Zagordi,

I've been trying to apply Shorah to a pretty homogenous population of rabies virus (sequence length ~11kb). The data are from the Illumina Myseq patfom. it seems the analysis starts well. however, I get numerous warnings the frequency of haplotypes exceeds 1, and afterwards lots of error messages. Could you please point me to what error I'm making? Thanks a lot!

Bram

Frequency error message:

/Applications/_evolutionarySoftware/shorah-master/snv.py:127: UserWarning: posterior = 1.364 > 1 warnings.warn('posterior = %4.3f > 1' % post)

....

next error message:

Traceback (most recent call last): File "/Applications/_evolutionarySoftware/shorah-master/shorah.py", line 175, in mm.main('%s_cor.rest' % in_stem, maxhaplo=200) File "/Applications/_evolutionarySoftware/shorah-master/mm.py", line 713, in main totalPaths = countPaths(graph, source, sink) File "/Applications/_evolutionarySoftware/shorah-master/mm.py", line 630, in countPaths getPaths(source) File "/Applications/_evolutionarySoftware/shorah-master/mm.py", line 627, in getPaths numPaths[node] = sum(map(getPaths, graph[node])) ....

and finally the crash message:

RuntimeError: maximum recursion depth exceeded in cmp

ozagordi commented 9 years ago

Hi, sorry for the late reply. The first error is not exactly a bug, rather a problem in probabilistic clustering. It could be ignored for posteriors only slightly higher than one, but in general it signals a poor mixing. The crash message at first sight seems saying that the program did not find a solution in the maximum number of iterations. You would need to tell me more of your problem and maybe provide me with a toy example. First of all, what window size did you choose? Do you really need global reconstruction or could you do with local only?

bramvrancken commented 9 years ago

Hello Osvaldo,

I'm sorry in turn for the late reply. Some details on the data that are hopefully of diagnostic use: it are paired end MySeq reads, ~145nt long. Through this link you can access the dropbox folder I created, and find the sequence data (Vphaser_input_sorted.bam) and the reference sequence (S3_ntCons.fasta). The various folders have the name of the options I tested. link: https://www.dropbox.com/sh/ao3q0iv3neq8vyr/AAD9p2Ii1ZTyT7sN0BTzdeQza?dl=0

Do you really need global reconstruction or could you do with local only This is a sample with little variation in the population. To illustrate I also included the overview per AA position (*.xls). Despite the risk for in silico recombinants, we prefer global reconstruction -after all, the last few variants are close to each other (within 1 read length).

Again thanks for looking into this.

Sincerely, Bram

On 17 Feb 2015, at 23:30, Osvaldo Zagordi wrote:

Hi, sorry for the late reply. The first error is not exactly a bug, rather a problem in probabilistic clustering. It could be ignored for posteriors only slightly higher than one, but in general it signals a poor mixing. The crash message at first sight seems saying that the program did not find a solution in the maximum number of iterations. You would need to tell me more of your problem and maybe provide me with a toy example. First of all, what window size did you choose? Do you really need global reconstruction or could you do with local only?

On 13 February 2015 at 22:37, bramvrancken notifications@github.com<mailto:notifications@github.com> wrote:

Dear Dr. Zagordi,

I've been trying to apply Shorah to a pretty homogenous population of rabies virus (sequence length ~11kb). The data are from the Illumina Myseq patfom. it seems the analysis starts well. however, I get numerous warnings the frequency of haplotypes exceeds 1, and afterwards lots of error messages. Could you please point me to what error I'm making? Thanks a lot! Bram Frequency error message:

/Applications/_evolutionarySoftware/shorah-master/snv.py:127: UserWarning: posterior = 1.364 > 1 warnings.warn('posterior = %4.3f > 1' % post) .... next error message:

Traceback (most recent call last): File "/Applications/_evolutionarySoftware/shorah-master/shorah.py", line 175, in mm.main('%s_cor.rest' % in_stem, maxhaplo=200) File "/Applications/_evolutionarySoftware/shorah-master/mm.py", line 713, in main totalPaths = countPaths(graph, source, sink) File "/Applications/_evolutionarySoftware/shorah-master/mm.py", line 630, in countPaths getPaths(source) File "/Applications/_evolutionarySoftware/shorah-master/mm.py", line 627, in getPaths numPaths[node] = sum(map(getPaths, graph[node])) .... and finally the crash message:

RuntimeError: maximum recursion depth exceeded in cmp

— Reply to this email directly or view it on GitHub https://github.com/ozagordi/shorah/issues/14.

Ciao. Osvaldo

— Reply to this email directly or view it on GitHubhttps://github.com/ozagordi/shorah/issues/14#issuecomment-74769131.

ozagordi commented 9 years ago

Bram, I looked at your alignment with samtools tview Vphaser_input_sorted.bam S3_ntCons.fasta There is hardly any variation.

Then I checked the coverage with samtools depth Vphaser_input_sorted.bam | less, and I realised that you have coverage only little above 100. Shorah is designed to work in heterogeneous samples at high coverage, so I'm really not sure it's the right tool for you. I don't think you can get more than a consensus sequence here.

bramvrancken commented 9 years ago

Hello Osvaldo,

I understand that the reconstruction is difficult when little variation is present, but was under the impression this has more to do with correctly linking variations separated by more than 1 read length than correctly inferring there is no variation. Is this view of the haplotype reconstruction problem wrong?

Thanks again for having a look at the data.

Best wishes, Bram

On 23 Feb 2015, at 10:00, Osvaldo Zagordi wrote:

Bram, I looked at your alignment with samtools tview Vphaser_input_sorted.bam S3_ntCons.fasta There is hardly any variation.

Then I checked the coverage with samtools depth Vphaser_input_sorted.bam | less, and I realised that you have coverage only little above 100. Shorah is designed to work in heterogeneous samples at high coverage, so I'm really not sure it's the right tool for you. I don't think you can get more than a consensus sequence here.

— Reply to this email directly or view it on GitHubhttps://github.com/ozagordi/shorah/issues/14#issuecomment-75508167.

ozagordi commented 9 years ago

You refer to having variants far apart and not knowing how to phase them. In this case in theory each local reconstruction should return one haplotype only. I saw this in one of your outputs, but you might check more extensively. Then the global reconstruction should phase them altogether in a single haplotype, but I don't think I ever tested the behaviour in this extreme case since this is more correctly done by standard denovo assemblers and it is not in the scope of the software.

Update: I checked and actually I had already opened an issue to fix the single haplotype/no reads problem issue #6