Failing at SNP #0 at position 0, questions for understanding output

SamStudio8 / gretel

An algorithm for recovering haplotypes from metagenomes

https://gretel.readthedocs.io/en/latest/

MIT License

31 stars 4 forks source link

Failing at SNP #0 at position 0, questions for understanding output #29

Closed mherold1 closed 4 years ago

mherold1 commented 4 years ago

Hello,

thanks for providing the tool. What I would like to use it for is to determine haplotypes of marker genes (actually specific regions) from a mixture of strains (in this case 2 strains, 0.9 and 0.1 relative abundance). I had some issues when running gretel and was hoping for some advice. I am using the following version installed with conda: gretel 0.0.93

I tested separate mappings for each of my genes of interest. For the first gene there are several SNPs:

##fileformat=VCFv4.2
CAMP0325_717    627 .   G   A,T,C   0   .   INFO
CAMP0325_717    636 .   G   A,T,C   0   .   INFO
[...]
CAMP0325_717    955 .   G   A,T,C   0   .   INFO
CAMP0325_717    963 .   G   A,T,C   0   .   INFO
CAMP0325_717    987 .   G   A,T,C   0   .   INFO
CAMP0325_717    1035    .   G   A,T,C   0   .   INFO
[...]
CAMP0325_717    1102    .   G   A,T,C   0   .   INFO
[...]

When running gretel I get the following:

$ gretel mybam snps.vcf.gz contig_id -s 621 -e 1104 --master ref.fa -o out/
[FAIL] Unable to recover pairwise evidence concerning SNP #18 at position 1035
[...]

When I change to -s 621 -e 1000 only 1 haplotype is predicted even though in the pileup it seems there should be read evidence: pileup_screen_pgm

$ cat gretel.crumbs 
# 17    119 3   10.00
0   1   -3.01   -3.01   52.00

When using a different gene:

$zcat snps.vcf.gz
##fileformat=VCFv4.2
CAMP0075_53 615 .   G   A,T,C   0   .   INFO
CAMP0075_53 627 .   G   A,T,C   0   .   INFO
CAMP0075_53 663 .   G   A,T,C   0   .   INFO
CAMP0075_53 702 .   G   A,T,C   0   .   INFO
CAMP0075_53 792 .   G   A,T,C   0   .   INFO
CAMP0075_53 897 .   G   A,T,C   0   .   INFO
CAMP0075_53 1032    .   G   A,T,C   0   .   INFO
CAMP0075_53 1094    .   G   A,T,C   0   .   INFO
CAMP0075_53 1098    .   G   A,T,C   0   .   INFO
CAMP0075_53 1117    .   G   A,T,C   0   .   INFO
CAMP0075_53 1176    .   G   A,T,C   0   .   INFO
CAMP0075_53 1251    .   G   A,T,C   0   .   INFO
CAMP0075_53 1275    .   G   A,T,C   0   .   INFO

Running gretel fails for a SNP at position 0 which does not exist:

[NOTE] Setting Gretel.L to 2
[FAIL] Unable to recover pairwise evidence concerning SNP #0 at position 0
       Gretel needs every SNP to appear on a read with at least one other SNP, at least once.

general questions:

do SNPs have to co-occur on the same read or on the same read-pair to be counted?

questions concerning output:

Does snp.fasta contain the alleles of all variants for each haplotype?
Does out.fasta contain the reconstructed haplotype sequences for the specified regions (--master .fa replaced by alleles in snp.fasta?

For gretel.crumbs I am a bit confused with what is written in the documentation (https://gretel.readthedocs.io/en/latest/protocol.html#gretel-crumbs) For the example from above...:

$ cat gretel.crumbs 
# 17    119 3   10.00
0   1   -3.01   -3.01   52.00

... my understanding would be the following:

# 17:"17 variants" , 119: ? , 3: ? , 10.0: ?
0: index of iteration, 1: number of haplotypes, -3.01: weighted likelihood, -3.01: unweighted likelihood, 52.00: ?

is there a way to calculate haplotype frequencies from the output?

SamStudio8 commented 4 years ago

Hi @mherold1! Thanks for taking the time to open an issue with your questions:

Yes, SNPs need to co-occur on the same read to be counted. SNPs that appear alone are ignored, because they can't provide any pairwise evidence to be inserted into Hansel. A current shortcoming of Gretel is that it does not handle paired end reads because of the way the BAM parser is written. This is a planned improvement, but I am working on other tools right now.

I can see from your screenshot that there should be evidence of an alternative haplotype in the data that you have. However, we do not use the single position allele frequencies, but the pairwise observations. If the alternative allele at position 1035 is not observed with other SNPs often enough, then this is why the Hansel matrix will be depleted of the evidence. This is probably why you get the error when -e is set to 1104.

Correct, snp.fasta is just the sequence of SNPs and out.fasta are the actual haplotypes (with the non-SNP positions copied from your --master). This documentation needs improving as I only recently added the snp.fasta file.

The first line in the crumbs file gives some information about the observations added to Hansel. From this I can tell there were 17 SNPs, and 119 SNP pairs in your data from 3 reads (which doesn't seem right given your screenshot?) and the Hansel markov chain was set to 10 SNPs. We don't provide a mechanism to calculate absolute frequency, but we have shown that the likelihoods allow you to order haplotypes by their approximate frequency.

mherold1 commented 4 years ago

I can see from your screenshot that there should be evidence of an alternative haplotype in the data that you have. However, we do not use the single position allele frequencies, but the pairwise observations

If pairwise observations of presence and absence are considered the example should still be "resolvable" given the mapping in the screenshot no?

there were 17 SNPs, and 119 SNP pairs in your data from 3 reads (which doesn't seem right given your screenshot?)

yes this seems odd. Could it be that there is a problem using the information from the bam file correctly? I guess this is also related to the inability to connect SNPs lateron.

Also could this affect the failure from above:

[FAIL] Unable to recover pairwise evidence concerning SNP #0 at position 0

I tried running another mapping file. Here the reads have not been generated in-silico as for the previous example and it seems the problem does not occur here. Is there are difference that could affect the parsing of the bam-file?

excerpt from bam files: artificial reads:

NZ_CP010464_107047_0    97  CAMP0075_53 210 24  226M    =   851 858 CCAAGATGCGATTATTAAAGCGTGCGATAAAATTTTAGAAGGCGGATATTATGATCAATTTGTAGTAGATATGATCCAAGGAGGTGCTGGAACAAGTACAAATATGAATGCTAATGAAGTTATAGCAAACATAGGACTTGAACTTATGGGTCATAAAAAAGGTGAATACCAATATCTTCATCCAAATGATCATGTTAATTTAAGTCAAAGTACTAATGATGCTTAT  CACCCGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGG:GGGGGGGGGGGGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGCGGGGGGGDG<GGGFGGGGGGGGGG9FGGGFGGGEGGGGFGGGGEGCFGGGGGGEGGGGGGGGGFGDFGFFGFGGGGFAF>GGGGGFGFFGE>GGGGFFGE9GGCGGFCGGGGEDGG8EEGG?EGGGDGC  AS:i:-89    XN:i:0  XM:i:18 XO:i:0  XG:i:0  NM:i:18 MD:Z:9T11T17G11C14T8A5T2A26C14C2T4C6G8C14G2T5C29C21 YS:i:-5 YT:Z:DP RG:Z:3M-completegenomes-binary-miseq

NZ_CP010464_402816_1    97  CAMP0075_53 101 0   241M    =   1113    1215    TTTCTCATGATAGATTAAAAGATTTTCCTCGCTTTGTTAGAGCTTTGGCTAGAGTAAAAAAAGCTGCAGCTATGGCAAATCATGAATTAGGTCTTTTGGATAAAAATATCCAAGATGCGATTATTAAAGCGTGCGATAAAATTTTAGAAGGCGGATATTATGATCAATTTGTAGTAGATATGATCCAAGGAGGTGCTGGAACAAGTACAAATATGAATGCTAATGAAGTTATAGCAAACAT   CCCCCGGGEGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGEGGGGGFGGGGGGGGGGFGGGGGGGEGGGCFEGGFGGGGGGGGGGGGGGGGGGEGG<GCGGGGGFFGFGGGFFEGGFGGGGGGGGFGGGDGGGGEGGGGGGGGGFGGGGGGGGGGFGGGG;G>GDD@CGCGGGG3GFFGGGGFGFCGGGFFGEG;FGFGGGCG6FGAC=?AG2@G;CGC7CE=F9   AS:i:-141   XN:i:0  XM:i:29 XO:i:0  XG:i:0  NM:i:29 MD:Z:4A2C2G2G0C4G5C2A8A8A2A2G2T8A2T12T16A6G13T11T17G11C14T8A5T2A26C14C2T2   YS:i:-85    YT:Z:DP RG:Z:3M-completegenomes-binary-miseq

"real" data reads:

NB551409:21:HGHM3AFXY:3:21405:2120:11614    163 CAMP0075_53 134 23  143M    =   430 447 TTGTTAGAGCTTTGGCTAGAGTAAAAAAAGCTGCAGCTATGGCAAATCATGAATTAGGTCTTTTGGATAAAAATATCCAAGATGCGATTATTAAAGCGTGCGATAAAATTTTAGAAGGCGGGTATTATGATCAATTTGTAGTA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEAEEEEEEEEEE<EE<EAEAEAEAEEEAE/AAEAEEEEE/EAE/EEEEEAA/A<AEE<AEAEEE/EA/AA<AEAEE<AA AS:i:-77    XN:i:0  XM:i:16 XO:i:0  XG:i:0  NM:i:16 MD:Z:4A8A2A2G2T8A2T12T16A6G13T11T17G5A5C14T0    YS:i:-10    YT:Z:CP RG:Z:Spiked_Mix_P1_2
NB551409:21:HGHM3AFXY:4:11601:6253:13456    163 CAMP0075_53 134 0   150M    =   298 307 TTGTTAGAGCTTTGGCTAGAGTAAAAAAAGCTGCAGCTATGGCAAATCATGAATTAGGTCTTTTGGATAAAAATATCCAAGATGCGATTATTAAAGCGTGCGATAAAATTTTAGAAGGCGGGTATTATGATCAATTTGTAGTAGATATGA  AAAAAEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/<EEEEEEEE<EA<EEAEEA/EEE<EEAAEEEA/<EEEEAE/E/EEE<AEAAAEAE/AEAEA</<A<AAAE<<<AE  AS:i:-77    XN:i:0  XM:i:16 XO:i:0  XG:i:0  NM:i:16 MD:Z:4A8A2A2G2T8A2T12T16A6G13T11T17G5A5C14T7    YS:i:-50    YT:Z:CP RG:Z:Spiked_Mix_P1_2

A current shortcoming of Gretel is that it does not handle paired end reads because of the way the BAM parser is written

Would it make sense to join reads then when working with paired-end data?

Looking at the test I ran on the real data the haplotype prediction seems to work quite well, however here I have several false positive haplotypes (the first 3 sequences from the out.fasta are correct):

# 8     4086    1470    3.00
3       1       -1.11   -2.94   336.24
0       3       -1.20,-2.30,-2.08       -1.20,-1.20,-1.20       2573.86
1       1       -1.20   -2.77   850.52
13      1       -1.24   -3.04   9.89
12      1       -1.40   -1.38   85.28
5       1       -1.46   -3.73   145.67
8       1       -1.76   -1.77   32.41
10      1       -1.78   -2.08   21.83
6       2       -1.85,-2.45     -2.56,-2.56     61.39
11      1       -2.00   -2.28   18.97
9       1       -2.43   -2.17   31.75

What does the last column stand for and how do you usually filter the predictions?

SamStudio8 commented 4 years ago

Hi @mherold1, Sorry for the delay. It's hard to see from the two examples of two reads why they may not be read by Gretel but I'd be happy to look at more data. It's possible that pysam may be doing some filtering of reads but I wouldn't have thought this would be a large problem as I chose quite permissive filtering options in the constructor.

You may run into trouble joining your reads as the read structure used by Gretel at the beginning assumes that each (QNAME, FLAG) pair in the SAM is unique. This is something that I've been thinking about adding, but needs more thought.

I'm happy to see that you get the haplotypes you expect, AND they are ranked as the most likely! From the crumbs file you've provided, I see there are only 8 SNPs which is why all the likelihoods are not so helpful in differentiating your result. From experience, the likelihoods seem to work better when there are more SNPs, which is one of the reasons we use the gretel-snpper subtool to generate VCF files for Gretel.

The last column is the amount of evidence that was removed from the Hansel matrix as a result of that haplotype being generated. It's somewhat a proxy of confidence but is biased towards haplotypes that are generated sooner rather than later.

mherold1 commented 4 years ago

Thanks for the reply and further explanations.

I am having a bit of a hard time locating where in the load from bam function something might go wrong.

Files for the example: https://owncloud.lcsb.uni.lu/s/8KquFDLXptpuaAS password is: 123

Besides trying to run the function step by step (unsuccessfully), I also tried the following:

path = "3M-completegenomes-binary-miseq.sorted.bam"
contig = "CAMP0075_53"
start_pos = 609
end_pos = 1104
threads = 1
vcf_path = "snps.vcf.gz"

bam = pysam.AlignmentFile(path)
aaa=list()
for read in bam.fetch():
    aaa.append(read.query_name)

ff = util.load_from_bam(path, contig, start_pos, end_pos, load_vcf(vcf_path, contig, start_pos, end_pos), n_threads=threads, debug_reads=aaa)

NZ_CP010464_43163_0/1 0 1094 C
NZ_CP010464_43163_0/1 0 1098 G
RANK NZ_CP010464_43163_0/1 0 7
[None]
[None]
[NOTE] Loaded 1 breadcrumbs from 1 bread slices.
[NOTE] Setting Gretel.L to 2

when I use the other bam file...:

path="17051206.sorted.bam"

... I get the following:

[...]
MN00415:41:000H2CMJV:1:12108:1314:7163 1 897 T
RANK MN00415:41:000H2CMJV:1:12108:1314:7163 1 5
MN00415:41:000H2CMJV:1:11108:11147:5151 2 1094 T
MN00415:41:000H2CMJV:1:11108:11147:5151 2 1098 A
RANK MN00415:41:000H2CMJV:1:11108:11147:5151 2 7
MN00415:41:000H2CMJV:1:23103:3873:2674 2 1032 C
MN00415:41:000H2CMJV:1:23103:3873:2674 2 1094 T
MN00415:41:000H2CMJV:1:23103:3873:2674 2 1098 A
RANK MN00415:41:000H2CMJV:1:23103:3873:2674 2 6
MN00415:41:000H2CMJV:1:12103:4236:3357 2 1032 C
MN00415:41:000H2CMJV:1:12103:4236:3357 2 1094 T
MN00415:41:000H2CMJV:1:12103:4236:3357 2 1098 A
RANK MN00415:41:000H2CMJV:1:12103:4236:3357 2 6
MN00415:41:000H2CMJV:1:13102:4168:9526 1 1094 T
MN00415:41:000H2CMJV:1:13102:4168:9526 1 1098 A
RANK MN00415:41:000H2CMJV:1:13102:4168:9526 1 7
MN00415:41:000H2CMJV:1:13108:25550:4083 1 897 T
RANK MN00415:41:000H2CMJV:1:13108:25550:4083 1 5
MN00415:41:000H2CMJV:1:12110:16201:7833 1 1032 C
MN00415:41:000H2CMJV:1:12110:16201:7833 1 1094 T
MN00415:41:000H2CMJV:1:12110:16201:7833 1 1098 A
RANK MN00415:41:000H2CMJV:1:12110:16201:7833 1 6
MN00415:41:000H2CMJV:1:22103:13633:5451 1 1032 C
MN00415:41:000H2CMJV:1:22103:13633:5451 1 1094 T
MN00415:41:000H2CMJV:1:22103:13633:5451 1 1098 A
RANK MN00415:41:000H2CMJV:1:22103:13633:5451 1 6
MN00415:41:000H2CMJV:1:13104:17656:15164 2 1094 T
MN00415:41:000H2CMJV:1:13104:17656:15164 2 1098 A
RANK MN00415:41:000H2CMJV:1:13104:17656:15164 2 7
MN00415:41:000H2CMJV:1:22108:13551:4487 1 1094 T
MN00415:41:000H2CMJV:1:22108:13551:4487 1 1098 A
RANK MN00415:41:000H2CMJV:1:22108:13551:4487 1 7
[...]
[32]
[29]
[24]
[None]
[NOTE] Loaded 966 breadcrumbs from 427 bread slices.
[NOTE] Setting Gretel.L to 3

So it seems like it stops after trying to read in the first read in the first example. Could this be related to the names of the reads that contain underscores? I found this line that might be related but didn't find any line where underscores could prove to be problematic in the following code. I am also not sure where the / comes from in the debug_reads output.

A question regarding the number of SNPs: Would it be beneficial to have more uninformative SNPs, i.e. SNPs that are different from the artificial reference but have the same allele within the sample. So I could just chose a more dissimilar artificial reference?

SamStudio8 commented 4 years ago

I've downloaded your example data (which I note is the hiseq simulation, not the miseq - I don't know if that will turn out to be important here) and edited your example and it works for me!

from gretel import util as util                                                                                                                                                                                                                                                   
import pysam                                                                                                                                                                                                                                                                      

path="3M-completegenomes-binary-hiseq.sorted.bam"                                                                                                                                                                                                                                 
contig = "CAMP0075_53"                                                                                                                                                                                                                                                            
start_pos = 609                                                                                                                                                                                                                                                                   
end_pos = 1104                                                                                                                                                                                                                                                                    
threads = 1                                                                                                                                                                                                                                                                       
vcf_path = "snps.vcf.gz"                                                                                                                                                                                                                                                          

bam = pysam.AlignmentFile(path)                                                                                                                                                                                                                                                   
aaa=list()                                                                                                                                                                                                                                                                        
for read in bam.fetch():                                                                                                                                                                                                                                                          
    aaa.append(read.query_name)                                                                                                                                                                                                                                                   
ff = util.load_from_bam(path, contig, start_pos, end_pos, util.process_vcf(vcf_path, contig, start_pos, end_pos), n_threads=threads, debug_reads=aaa)

[272]
[270]
[249]
[247]
[234]
[218]
[208]
[195]
[192]
[161]
[157]
[107]
[106]
[98]
[83]
[75]
[74]
[67]
[54]
[19]
[11]
[0]
[None]
[NOTE] Loaded 309 breadcrumbs from 148 bread slices.
[NOTE] Setting Gretel.L to 3

I had to change the load_vcf to process_vcf which suggests you might be on a slightly different version to me, which is strange because 0.0.93 is the latest. Are you able to run gretel --version to make sure?

Update Ah, I see you may have just used my load_vcf function from the test suite? In that case, can you send through the miseq example file from your last post so we can check that too?

mherold1 commented 4 years ago

Update Ah, I see you may have just used my load_vcf function from the test suite? In that case, can you send through the miseq example file from your last post so we can check that too?

yes I did use the load_vcf function from the test suite, however this doesn't seem to make a difference. When I change this to util.process_vcf, I get the following:

path="3M-completegenomes-binary-miseq.sorted.bam"
>>> ff = util.load_from_bam(path, contig, start_pos, end_pos, util.process_vcf(vcf_path, contig, start_pos, end_pos), n_threads=threads, debug_reads=aaa)
NZ_CP010464_43163_0/1 0 1094 C
NZ_CP010464_43163_0/1 0 1098 G
RANK NZ_CP010464_43163_0/1 0 7
[None]
[NOTE] Loaded 1 breadcrumbs from 1 bread slices.
[NOTE] Setting Gretel.L to 2

Sorry for uploading the wrong mapping file (I also uploaded the miseq simulation). Actually the with the hiseq simulation the load_from_bam works and I get the same output as you. I noticed that in the debug output there is no / in the hiseq simulation. When I run gretel I get a slightly different number of reads:

gretel -s 609 -e 1104 3M-completegenomes-binary-hiseq.sorted.bam snps.vcf.gz CAMP0075_53 --master CAMP0075_53_aspA.fa

[...]
[NOTE] Loaded 128 breadcrumbs from 90 bread slices.
[NOTE] Setting Gretel.L to 3
[FAIL] Unable to recover pairwise evidence concerning SNP #5 at position 897
[...]

Is this because reads are not considered after the position that cant be resolved?

for the miseq simulation:

gretel -s 609 -e 1104 3M-completegenomes-binary-miseq.sorted.bam snps.vcf.gz CAMP0075_53 --master CAMP0075_53_aspA.fa

[None]
[NOTE] Loaded 0 breadcrumbs from 0 bread slices.
Traceback (most recent call last):
  File "/home/epi_mher/miniconda2/envs/py3/bin/gretel", line 12, in <module>
    sys.exit(main())
  File "/home/epi_mher/miniconda2/envs/py3/lib/python3.5/site-packages/gretel/cmd.py", line 76, in main
    hansel = util.load_from_bam(ARGS.bam, ARGS.contig, ARGS.start, ARGS.end, VCF_h, n_threads=ARGS.threads, debug_reads=debug_reads, debug_pos=debug_pos)
  File "/home/epi_mher/miniconda2/envs/py3/lib/python3.5/site-packages/gretel/util.py", line 321, in load_from_bam
    hansel.L = int(ceil(float(total_covered_snps.value)/n_reads.value))
ZeroDivisionError: float division by zero

$ conda activate py3
$ gretel --version
gretel 0.0.93

SamStudio8 commented 4 years ago

Easy one first - to answer the question of where the / comes from, turns out, it's actually in the BAM:

NZ_CP010464_43163_0/1   0       CAMP0075_53     1068    8       235M    *       0       0       ACAACTTAATGTTTTTGAACCAGTTGCAGCGTATAGCTTATTTAATTCTATTGTAATGCTAGAAAAAGCAATGTATACTTTAGCTGATAAATGTATAGATGGTATTACTGCAAATGAAAAAATTTGTTCAGACTTTGTCTATAATTCAGTAGGTATAGTAACAGCTTTAAATCCTTATATTGGCTATGAAAATTCAGCTTCTATAGCTAAAGAGGCAATGAACACTGGAAAAAGA     CCCCCGGGGGGDGGFGGGGGGGGGGGFGGGGGGGGGGGGGFGGGGGGGGGGGFGGGGGFGGGGFGGGGGGGGGGGEFFGGGGGGFGGGFGGGGGEGGGGGGGFFEGGFGG>GGGGGGGGDGGGDGGGBGGGGGGGGGGGGGGGF<EGGGGGGGGGGDGGFGGGGGGGGGFGGGGDGGFEGGFGEDGFCFGFG;FCFCGGFGGG>FEFG<FGBGC6E?FGGAC:=GC?GDCFEFGG     AS:i:-84        XN:i:0  XM:i:17 XO:i:0  XG:i:0  NM:i:17 MD:Z:26T3A18G47A37C44A2T14C0A0G3T11T4G0T2A2T2G3 YT:Z:UU RG:Z:3M-completegenomes-binary-miseq

I can see in Tablet that the reads are sorted and aligned along the reference. I've enhanced your example code and replaced the pysam fetch with pileup as that is actually what is used by gretel to traverse the region. It was then quick to confirm that the length of the pileup over all sites is just 1. pysam is filtering your reads in this case, leaving only one for us to read.

The default pileup stepper is the samtools stepper, which attempts to emulate the samtools pileup functionality. My guess is this is the cause of why a large number of your reads to be skipped. I added a line to the example code that printed the read name and alignment flag of every read at every pileup column (note that this number therefore reflects the number of bases, not the number of reads).

from gretel import util as util                                                                                                                                                                                                                                                   
import pysam                                                                                                                                                                                                                                                                      

path="gretel_example/3M-completegenomes-binary-miseq.sorted.bam"                                                                                                                                                                                                                  
contig = "CAMP0075_53"                                                                                                                                                                                                                                                            
start_pos = 609                                                                                                                                                                                                                                                                   
end_pos = 1104                                                                                                                                                                                                                                                                    
threads = 1                                                                                                                                                                                                                                                                       
vcf_path = "snps.vcf.gz"                                                                                                                                                                                                                                                          

bam = pysam.AlignmentFile(path)                                                                                                                                                                                                                                                   
aaa=[]                                                                                                                                                                                                                                                                            
for read in bam.fetch(contig=contig, start=start_pos-1, end=end_pos):                                                                                                                                                                                                             
    aaa.append(read.query_name)                                                                                                                                                                                                                                                   

for p_col in bam.pileup(reference=contig, start=start_pos-1, stop=end_pos, ignore_overlaps=False, min_base_quality=0, stepper="all"):                                                                                                                                             
    for pread in p_col.pileups:                                                                                                                                                                                                                                                   
        print(pread.alignment.query_name, pread.alignment.flag)

You can see the default stepper skips all the reads, except the one you observed, NZ_CP010464_43163_0/1 which has a flag of 0 in your BAM:

With default stepper

python test.py | cut -f2 -d' ' | sort | uniq -c
    235 0

...and setting the stepper to "all", does not:

With all stepper

python test.py | cut -f2 -d' ' | sort | uniq -c
    235 0
  12931 145
  37037 153
 119335 73
   6091 97

You can use the picard explain-flags tool to work out what these flags mean:

145 read paired (0x1) read reverse strand (0x10) second in pair (0x80)

153 read paired (0x1) mate unmapped (0x8) read reverse strand (0x10) second in pair (0x80)

73 read paired (0x1) mate unmapped (0x8) first in pair (0x40)

97 read paired (0x1) mate reverse strand (0x20) first in pair (0x40)

I believe these are filtered as the read in proper pair flag is not set. This is expected behaviour within pysam but arguably our users may not care. Are these improper pairs expected in your data?

SamStudio8 commented 4 years ago

@mherold1 In the mean time, if you don't really care about proper pairs, I've added a new option to gretel:

  --pepper              enable a more permissive pileup by setting the pysam
                        pileup stepper to 'all', instead of 'samtools'. Note
                        that this will allow improper pairs.

You can use this mode by passing stepper="all" to the load_from_bam function, or by adding --pepper to your gretel invocation. This is currently only available for gretel 0.0.94 from pip as it normally takes a while to update the conda package.

I've tested that this works more permissively on your data set:

[NOTE] Loaded 1081 breadcrumbs from 527 bread slices.
[NOTE] Setting Gretel.L to 3

mherold1 commented 4 years ago

Wow, thanks! I think with this the issue should be resolved. I will try the new version tomorrow.

Are these improper pairs expected in your data?

There are very few properly paired reads in the simulated datasets, some more in the hiseq simulation and 0 in the miseq. I assume this has to do with the insert size in the tool used for simulating the reads (https://github.com/HadrienG/InSilicoSeq) and the relatively short reference sequences (I used gene sequences around 1400bp, and then a region of 480bp for variant calling). The mean insert size for miseq is around 950 and for hiseq 490. I did not think about this beforehand.

SamStudio8 commented 4 years ago

@mherold1 Great! Let me know how you get on, and please don't hesitate to open more issues if you run into trouble!