cbib / MICADo

Looking for mutations in PacBio cancer data: an alignment-free method
Other
8 stars 2 forks source link

Bug in the sampler #18

Closed massyah closed 5 years ago

massyah commented 8 years ago

Understand why the sample C_SYNTHP53_13643_1000_50_3_1-1-1 doesn't have the alteration that C_SYNTHP53_13643_700_50_3_1-1-1 has We should find the (injected and altered) k-mer TCCCCAGCCAAAGAAGAC in both files, but it's completely missing in the sample with 1000 reads => micado doesn't find it.

massyah commented 8 years ago

Grep shows that the altered k-mer is missing in several samples with 50% alterations:

grep "TCCCCAGCCAAAGAAGAC" data/synthetic/reads/13643 -l data/synthetic/reads/C_SYNTHP53_13643_1000_035_1_1-1-1.fastq data/synthetic/reads/C_SYNTHP53_13643_1000_035_2_1-1-1.fastq data/synthetic/reads/C_SYNTHP53_13643_1000_035_3_1-1-1.fastq data/synthetic/reads/C_SYNTHP53_13643_1000_040_1_1-1-1.fastq [...]

massyah commented 8 years ago

Analysis of the log of the sampler shows that as we increase the coverage, we increase the probability of picking reads mapping outside the region of interest, e.g. for 13643:

exec_logs/sampler_log_C_SYNTHP53_13643_1000_035_1_1-1-1.txt:2015-11-24 23:03:48,398 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_035_2_1-1-1.txt:2015-11-24 23:12:01,721 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_035_3_1-1-1.txt:2015-11-24 23:08:18,461 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_040_1_1-1-1.txt:2015-11-24 23:06:28,524 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_040_2_1-1-1.txt:2015-11-24 23:08:50,392 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_040_3_1-1-1.txt:2015-11-24 23:15:48,883 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_045_1_1-1-1.txt:2015-11-24 23:05:49,408 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_045_2_1-1-1.txt:2015-11-24 23:02:48,571 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_045_3_1-1-1.txt:2015-11-24 23:09:44,280 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_05_1_1-1-1.txt:2015-11-24 23:11:26,692 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_05_2_1-1-1.txt:2015-11-24 23:05:17,882 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_05_3_1-1-1.txt:2015-11-24 23:03:18,633 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_1000_10_1_1-1-1.txt:2015-11-24 23:15:07,383 - READSAMPLER - INFO - Will sample between 803 and 1838 exec_logs/sampler_log_C_SYNTHP53_13643_1000_10_2_1-1-1.txt:2015-11-24 23:02:52,669 - READSAMPLER - INFO - Will sample between 803 and 1838 exec_logs/sampler_log_C_SYNTHP53_13643_1000_10_3_1-1-1.txt:2015-11-24 23:11:55,515 - READSAMPLER - INFO - Will sample between 803 and 1838 exec_logs/sampler_log_C_SYNTHP53_13643_1000_50_1_1-1-1.txt:2015-11-24 23:07:10,597 - READSAMPLER - INFO - Will sample between 801 and 1838 exec_logs/sampler_log_C_SYNTHP53_13643_1000_50_2_1-1-1.txt:2015-11-24 23:13:20,414 - READSAMPLER - INFO - Will sample between 801 and 1838 exec_logs/sampler_log_C_SYNTHP53_13643_1000_50_3_1-1-1.txt:2015-11-24 23:11:54,097 - READSAMPLER - INFO - Will sample between 801 and 1838 exec_logs/sampler_log_C_SYNTHP53_13643_150_035_1_1-1-1.txt:2015-11-24 23:09:42,442 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_035_2_1-1-1.txt:2015-11-24 23:05:47,662 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_035_3_1-1-1.txt:2015-11-24 23:13:23,116 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_040_1_1-1-1.txt:2015-11-24 23:12:21,061 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_040_2_1-1-1.txt:2015-11-24 23:06:46,324 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_040_3_1-1-1.txt:2015-11-24 23:10:31,455 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_045_1_1-1-1.txt:2015-11-24 23:03:55,751 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_045_2_1-1-1.txt:2015-11-24 23:04:52,571 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_045_3_1-1-1.txt:2015-11-24 23:06:39,648 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_05_1_1-1-1.txt:2015-11-24 23:07:33,128 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_05_2_1-1-1.txt:2015-11-24 23:06:41,331 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_05_3_1-1-1.txt:2015-11-24 23:03:50,624 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_10_1_1-1-1.txt:2015-11-24 23:12:12,649 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_10_2_1-1-1.txt:2015-11-24 23:11:33,282 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_10_3_1-1-1.txt:2015-11-24 23:08:17,742 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_50_1_1-1-1.txt:2015-11-24 23:10:55,358 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_50_2_1-1-1.txt:2015-11-24 23:15:32,128 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_150_50_3_1-1-1.txt:2015-11-24 23:05:03,827 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_035_1_1-1-1.txt:2015-11-24 23:06:56,373 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_035_2_1-1-1.txt:2015-11-24 23:08:25,322 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_035_3_1-1-1.txt:2015-11-24 23:12:11,207 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_040_1_1-1-1.txt:2015-11-24 23:04:54,085 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_040_2_1-1-1.txt:2015-11-24 23:05:51,849 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_040_3_1-1-1.txt:2015-11-24 23:08:13,583 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_045_1_1-1-1.txt:2015-11-24 23:05:20,378 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_045_2_1-1-1.txt:2015-11-24 23:13:45,116 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_045_3_1-1-1.txt:2015-11-24 23:11:42,003 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_05_1_1-1-1.txt:2015-11-24 23:14:31,646 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_05_2_1-1-1.txt:2015-11-24 23:05:35,399 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_05_3_1-1-1.txt:2015-11-24 23:14:07,620 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_10_1_1-1-1.txt:2015-11-24 23:13:58,909 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_10_2_1-1-1.txt:2015-11-24 23:06:51,926 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_10_3_1-1-1.txt:2015-11-24 23:14:12,171 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_500_50_1_1-1-1.txt:2015-11-24 23:09:01,358 - READSAMPLER - INFO - Will sample between 803 and 1238 exec_logs/sampler_log_C_SYNTHP53_13643_500_50_2_1-1-1.txt:2015-11-24 23:14:18,958 - READSAMPLER - INFO - Will sample between 803 and 1238 exec_logs/sampler_log_C_SYNTHP53_13643_500_50_3_1-1-1.txt:2015-11-24 23:09:05,986 - READSAMPLER - INFO - Will sample between 803 and 1238 exec_logs/sampler_log_C_SYNTHP53_13643_700_035_1_1-1-1.txt:2015-11-24 23:05:46,351 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_035_2_1-1-1.txt:2015-11-24 23:07:40,720 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_035_3_1-1-1.txt:2015-11-24 23:10:00,400 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_040_1_1-1-1.txt:2015-11-24 23:08:40,592 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_040_2_1-1-1.txt:2015-11-24 23:15:20,234 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_040_3_1-1-1.txt:2015-11-24 23:13:47,712 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_045_1_1-1-1.txt:2015-11-24 23:08:24,551 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_045_2_1-1-1.txt:2015-11-24 23:12:41,146 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_045_3_1-1-1.txt:2015-11-24 23:04:50,055 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_05_1_1-1-1.txt:2015-11-24 23:15:49,762 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_05_2_1-1-1.txt:2015-11-24 23:13:06,948 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_05_3_1-1-1.txt:2015-11-24 23:06:53,080 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_10_1_1-1-1.txt:2015-11-24 23:06:30,928 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_10_2_1-1-1.txt:2015-11-24 23:09:08,884 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_10_3_1-1-1.txt:2015-11-24 23:13:29,926 - READSAMPLER - INFO - Will sample between 803 and 1237 exec_logs/sampler_log_C_SYNTHP53_13643_700_50_1_1-1-1.txt:2015-11-24 23:14:31,781 - READSAMPLER - INFO - Will sample between 801 and 1238 exec_logs/sampler_log_C_SYNTHP53_13643_700_50_2_1-1-1.txt:2015-11-24 23:06:08,475 - READSAMPLER - INFO - Will sample between 801 and 1238 exec_logs/sampler_log_C_SYNTHP53_13643_700_50_3_1-1-1.txt:2015-11-24 23:06:22,885 - READSAMPLER - INFO - Will sample between 801 and 1238

JRudewicz commented 5 years ago

The sampler method has change. We now sample in each real sample and not in a big bulk sample resulting from the merge of all samples as previously.