Closed caseywdunn closed 10 months ago
Here is tail of output of first pcr:
The count of kmers containing primers have high coverage 22 relative to the coverage threshold of 3. Increasing min coverage to 4.
Creating graph, seeding with nodes that contain primer matches...
There are 10 start nodes
There are 8 end nodes
Extending the assembly graph...
Final extension statistics:
There are 5084 nodes in the graph
There are 5098 edges in the graph
There are 17 components in the graph
The graph does not have cycles
Number of descendants to a depth of 4, max 13 mean 4.04 median 4.0
There are 10 start nodes with edges
There are 1 end nodes with edges
done. Time to extend graph: 154.3497ms
Pruning the assembly graph...
There are 659 nodes in the graph
There are 658 edges in the graph
There are 1 start nodes
There are 1 end nodes
There are 1 components in the graph
done. Time to prune graph: 5.155ms
Traversing the assembly graph to find paths from forward to reverse primers...
There are 1 paths from forward to reverse primers in the graph
done. Time to traverse graph: 284.7µs
Generating sequences from paths...
>YPM-IZ-104464 co1bp0 product 0 length 688 kmer count stats mean 16.91 median 15 min 10 max 29
TCATAAAGATATAGGAACATTATACCTAGTCTTTGGTTTATTTTCAGGTATGGTAGGAACTGCTCTTAGTATGTTAATCAGGTTAGAGTTATCAGGACCCGGTACTATGTTTGGAGATGATCACCTTTATAATGTTATAGTAACTGCTCATGCCTTTGTCATGATCTTTTTCCTTGTAATGCCAGTCCTAATCGGAGGCTTCGGTAACTGGTTCGTACCCCTATTTATAGGTGCTCCGGATATGGCCTTCCCTAGGTTGAACAACCTAAGTTTTTGATTATTACCCCCTGCCTTATTACTACTATTAGGGTCATCCTTGATAGAACAAGGTGCAGGAACTGGTTGAACTGTTTACCCTCCTTTGTCTGGCCCCCAAACTCATTCTGGGGGATCAGTTGATATGGCTATTTTCAGTTTACACTGTGCGGGTGCCTCCTCAATTATGGGTGCTATTAACTTCATAACCACTATATTTAATATGAGGGCCCCTGGTATGACTATGGACAAGTTACCATTATTTGTCTGATCAGTTTTGATAACTGCCTTCCTCTTATTACTATCATTACCCGTGTTGGCCGGAGCTATAACTATGTTACTTACTGATAGAAATTTTAATACTACTTTCTTCGACCCTGCGGGAGGTGGTGATCCAGTTCTCTATCAACATTTGTTCTGGTTTTTTGGTCAC
done.
For gene co1bp0, 1 PCR products were generated and retained.
Here is tail of output for second PCR:
Running PCR on gene co1bp1
Removing kmers with coverage less than 3...
The number of unique kmers went from 50442347 to 3201796
Expanding the forward primer into all variants
Trimming the primer to CATAAAGATATAGGA so that it is within the trim length of 15.
There are 991 variants of the primer
Expanding the reverse primer into all variants
There are 991 variants of the primer
Finding kmers that contain the forward primer
TATAAAGATATCGGATCGCAGCAATCATTAA, count 3, keep false
CATAAAGATATAAAATTTATTGAAGATACTC, count 7, keep true
CATAAATATATAGGTTTATTAAAAGCGATAA, count 22, keep true
CATAAAAATATATGATATATTAAAAAGAAAA, count 8, keep true
CATAAACATTTAGGAATGGTTCTTGATTCCA, count 5, keep true
CATAAAGATATAGGAACATTATACCTAGTCT, count 16, keep true
CATAAAGAGATCGGAAGAGCACACGTCTGAA, count 9, keep true
CATAAACATTTAGGAATGATTTTAGATAACA, count 4, keep true
CAAAAAGATATATGAATAGTTTTTTTCCAGA, count 5, keep true
CATAAAGATATAGGCCCTTCAAAACGAAGAA, count 8, keep true
CATAAAAATAAAGGAATTACTATTGTTCAAG, count 3, keep false
CATAAAGATATCAGAAAATACTAAGCTCATT, count 4, keep true
CATAAATATGTAGGAATATATTAGCCAAGTA, count 3, keep false
CATAAAGAGATAGCAAAACACATGGAGGGAT, count 4, keep true
Finding kmers that contain the reverse primer
GTCAATGCAAAAAAGCTGGTTTTTTGTGCAC, count 4, keep true
TAGGAGTTGCTCTTAATGGTGTTTTGGTAAC, count 3, keep true
GTCATAAAAACAGACATGGTTTTTTGGATAC, count 9, keep true
ACCATAGGTGCATAATTGTTTTTTTGGCCAC, count 3, keep true
TTTCTTGTATATGTCTTGTTTTTTTGGTCCC, count 12, keep true
TTATACTAAAAGTAAATACTTTTTTGGTCAC, count 3, keep true
CTATCAACATTTGTTCTGGTTTTTTGGTCAC, count 25, keep true
GAATTTGGTGATTGGGTGGAGTTTTGGTCAC, count 3, keep true
The count of kmers containing primers have high coverage 22 relative to the coverage threshold of 3. Increasing min coverage to 4.
Creating graph, seeding with nodes that contain primer matches...
There are 11 start nodes
There are 8 end nodes
Extending the assembly graph...
Evaluating extension:
There are 11280 nodes in the graph
There are 11261 edges in the graph
There are 19 components in the graph
The graph does not have cycles
Number of descendants to a depth of 4, max 340 mean 3.99 median 0.0
Removing 57280 nodes descended from 108 nodes with ballooning graph extension
Final extension statistics:
There are 1389 nodes in the graph
There are 1370 edges in the graph
There are 19 components in the graph
The graph does not have cycles
Number of descendants to a depth of 4, max 49 mean 3.90 median 4.0
There are 11 start nodes with edges
There are 0 end nodes with edges
done. Time to extend graph: 96.7752ms
Pruning the assembly graph...
There are 0 nodes in the graph
There are 0 edges in the graph
There are 0 start nodes
There are 0 end nodes
There are 0 components in the graph
done. Time to prune graph: 595.2µs
Traversing the assembly graph to find paths from forward to reverse primers...
There are 0 paths from forward to reverse primers in the graph
done. Time to traverse graph: 193.7µs
For gene co1bp1, no path was found from a forward primer binding site to a reverse binding
site. Abandoning PCR.
Suggested actions:
- The max_length for the PCR product of 1000 my be too short. Consider increasing it.
- The primers may have non-specific binding and are not close enough to generate a
product. Consider increasing the primer TRIM length from the default to
create a more specific primer.
- The maximum count of a forward kmer is 22 and of a reverse kmer is 25. Large
differences in value can indicate non-specific binding of one of the primers.
Impacts on second run of advancing the primer include:
So it appears that the code that removes a ballooning subgraph is removing a valid path.
This is odd:
Evaluating extension:
There are 11280 nodes in the graph
There are 11261 edges in the graph
There are 19 components in the graph
The graph does not have cycles
Number of descendants to a depth of 4, max 340 mean 3.99 median 0.0
Removing 57280 nodes descended from 108 nodes with ballooning graph extension
The depth is 4, and 4^4 is 256. This should be the max number of descendants. But there are 340 max. And the median is 0. So moving primer forward by 1 hay have led evaluation to fall at a different time and revealed a bug in balloon pruning.
Could be a bug in n_descendants()
, or in the addition of nodes to the graph.
Added some more node and edge summary stats to summarize_extension()
.
Summary with --pcr "TCATAAAGATATAGGA_GTGACCAAAAAACCA_1000_co1bp1"
:
Final extension statistics:
There are 1372 nodes in the graph
There are 1353 edges in the graph
There are 19 components in the graph
The graph does not have cycles
Max node degree 4, mean 0.99 median 1.0
Max edge count 80750, mean 703.85 median 10.0
Number of descendants to a depth of 4, max 49 mean 3.90 median 4.0
There are 11 start nodes with edges
There are 0 end nodes with edges
done. Time to extend graph: 142.401583ms
And for --pcr "TCATAAAGATATAGG_GTGACCAAAAAACCA_1000_co1bp0"
Final extension statistics:
There are 5084 nodes in the graph
There are 5098 edges in the graph
There are 17 components in the graph
The graph does not have cycles
Max node degree 3, mean 1.00 median 1.0
Max edge count 148, mean 15.12 median 13.0
Number of descendants to a depth of 4, max 13 mean 4.04 median 4.0
There are 10 start nodes with edges
There are 1 end nodes with edges
done. Time to extend graph: 216.921542ms
So the failure in co1bp1
is associated with edges with very high count. Consistent with ballooning being the result of incorporation of a high abundance kmer (eg an adapter sequence) that leads to a highly branched graph.
This suggests that I can mitigate ballooning by excluding edges for kmers with very high counts.
But I still need to figure out why shifting the start position leads to ballooning.
Comparing the starting kmers of the two runs. Here they are for for co1bp0, indented are those that don't match co1bp1 from their second bp on:
Finding kmers that contain the forward primer
TCATAAAGATATAGGAACATTATACCTAGTC, count 16, keep true
TCATAAAAATATATGATATATTAAAAAGAAA, count 8, keep true
TCATAAACATTTAGGTTTAATTCTTGACTCT, count 9, keep true
TCATAAAGATATAAACTATATAGAAAACACT, count 4, keep true
TCATAAAGATATAAAATTTATTGAAGATACT, count 7, keep true
CCATAAAGATATAGGCCCTTCAAAACGAAGA, count 8, keep true
TTATAAAGATATACGCCCGATTAAACATATT, count 14, keep true
CCATAAATATATAGGTTTATTAAAAGCGATA, count 22, keep true
TCATAAATATACAGGTCCTTACAATCCATTA, count 5, keep true
TAATAAAGATATTGGTCTATAATTACTTTTT, count 5, keep true
The final product starts with TCATAAAGATATAGGAACATTATACCTAGTC
, which is the first kmer
And here are co1bp1 starting kmers:
Finding kmers that contain the forward primer
CATAAAGATATAAAATTTATTGAAGATACTC, count 7, keep true
CATAAATATATAGGTTTATTAAAAGCGATAA, count 22, keep true
CATAAAAATATATGATATATTAAAAAGAAAA, count 8, keep true
CATAAACATTTAGGAATGGTTCTTGATTCCA, count 5, keep true
CATAAAGATATAGGAACATTATACCTAGTCT, count 16, keep true
CATAAACATTTAGGAATGATTTTAGATAACA, count 4, keep true
CATAAAGAGATCGGAAGAGCACACGTCTGAA, count 9, keep true
CAAAAAGATATATGAATAGTTTTTTTCCAGA, count 5, keep true
CATAAAGATATAGGCCCTTCAAAACGAAGAA, count 8, keep true
CATAAAGATATCAGAAAATACTAAGCTCATT, count 4, keep true
CATAAAGAGATAGCAAAACACATGGAGGGAT, count 4, keep true
The kmer CATAAAGATATAGGAACATTATACCTAGTCT
starts at the second basepair of the final product with co1bp0, so a viable starting primer is there.
Both graphs have the subkmer ATAAAGATATAGGAACATTATACCTAGTCT
. So something must happen that poisons this path in co1bp1
Used the following to track graph extension from the verbose output:
../target/release/sharkmer -k 31 --verbosity 10 -m 800000 -n 100 -t 8 -s YPM-IZ-104464 --pcr "TCATAAAGATATAGG_GTGACCAAAAAACCA_1000_co1bp0" ~/palmer_scratch/tmp/NA24_164_029_S8_L002_R1_001_1M.fastq > colbp0.log
../target/release/sharkmer -k 31 --verbosity 10 -m 800000 -n 100 -t 8 -s YPM-IZ-104464 --pcr "TCATAAAGATATAGG_GTGACCAAAAAACCA_1000_co1bp1" ~/palmer_scratch/tmp/NA24_164_029_S8_L002_R1_001_1M.fastq > colbp1.log
python ../misc/path_finder.py colbp0.log ATAAAGATATAGGAACATTATACCTAGTCT
python ../misc/path_finder.py colbp1.log ATAAAGATATAGGAACATTATACCTAGTCT
In both cases, the full path of the gene was found. So it is not an extension problem - it is something about pruning. Could be for example that one of the different start kmers found with co1bp1 leads to ballooning, and the resulting pruning intended for other parts of the graph ends up clipping this desired path.
The following subkmers are found in the start kmers of co1bp1 but not co1bp0. Full kmer shown in second column. Parsed from output above.
CATAAACATTTAGGAATGGTTCTTGATTCC CATAAACATTTAGGAATGGTTCTTGATTCCA
CATAAACATTTAGGAATGATTTTAGATAAC CATAAACATTTAGGAATGATTTTAGATAACA
CATAAAGAGATCGGAAGAGCACACGTCTGA CATAAAGAGATCGGAAGAGCACACGTCTGAA
CAAAAAGATATATGAATAGTTTTTTTCCAG CAAAAAGATATATGAATAGTTTTTTTCCAGA
CATAAAGATATCAGAAAATACTAAGCTCAT CATAAAGATATCAGAAAATACTAAGCTCATT
CATAAAGAGATAGCAAAACACATGGAGGGA CATAAAGAGATAGCAAAACACATGGAGGGAT
None of the sub kmers from the left full kmers are present in the path in the verbose output. Suggests that they are not extended at all.
There is no evidence of ballooning in colbp1.log, would expect to see more subkmers with 3-4 candidates for extension:
(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 1" colbp1.log
4965
(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 2" colbp1.log
69
(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 3" colbp1.log
2
(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 4" colbp1.log
0
(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 0" colbp1.log
38
Calls for co1bp1
were wrong in last few experiments, should be --pcr "TCATAAAGATATAGGA_GTGACCAAAAAACCA_1000_co1bp1"
Last few lines of python ../misc/path_finder.py colbp1.log ATAAAGATATAGGAACATTATACCTAGTCT
are:
CTAGTCTTTGGTTTATTTTCAGGTATGGTA sub_kmer being extended for node 1558. There are 1 candidate kmers for extension. Added sub_kmer TAGTCTTTGGTTTATTTTCAGGTATGGTAG for new node 1927 with edge kmer count 14. Path length is 25. There are now 369 unvisited and 1638 non-terminal nodes in the graph.
TAGTCTTTGGTTTATTTTCAGGTATGGTAG sub_kmer being extended for node 1927. There are 1 candidate kmers for extension. Added sub_kmer AGTCTTTGGTTTATTTTCAGGTATGGTAGG for new node 2324 with edge kmer count 14. Path length is 26. There are now 397 unvisited and 1841 non-terminal nodes in the graph.
AGTCTTTGGTTTATTTTCAGGTATGGTAGG sub_kmer being extended for node 2324. There are 1 candidate kmers for extension. Added sub_kmer GTCTTTGGTTTATTTTCAGGTATGGTAGGA for new node 3388 with edge kmer count 14. Path length is 27. There are now 1064 unvisited and 2801 non-terminal nodes in the graph.
GTCTTTGGTTTATTTTCAGGTATGGTAGGA sub_kmer being extended for node 3388. There are 1 candidate kmers for extension. Added sub_kmer TCTTTGGTTTATTTTCAGGTATGGTAGGAA for new node 6651 with edge kmer count 14. Path length is 28. There are now 3263 unvisited and 6007 non-terminal nodes in the graph.
So it adds TCTTTGGTTTATTTTCAGGTATGGTAGGAA
, which is only at a path length of 28. That is the only line that this subkmer is found on in the colbp1.log log - it is never revisited. So for some reason extension stops.
It is extended just fine in the co1bp0 run:
$ python ../misc/path_finder.py colbp0.log TCTTTGGTTTATTTTCAGGTATGGTAGGAA
File: colbp0.log
Seed: TCTTTGGTTTATTTTCAGGTATGGTAGGAA
TCTTTGGTTTATTTTCAGGTATGGTAGGAA
TCTTTGGTTTATTTTCAGGTATGGTAGGAA sub_kmer being extended for node 240. There are 1 candidate kmers for extension. Added sub_kmer CTTTGGTTTATTTTCAGGTATGGTAGGAAC for new node 247 with edge kmer count 14. Path length is 30. There are now 7 unvisited and 236 non-terminal nodes in the graph.
CTTTGGTTTATTTTCAGGTATGGTAGGAAC sub_kmer being extended for node 247. There are 1 candidate kmers for extension. Added sub_kmer TTTGGTTTATTTTCAGGTATGGTAGGAACT for new node 254 with edge kmer count 13. Path length is 31. There are now 7 unvisited and 243 non-terminal nodes in the graph.
...
So for some reason it is never revisited in co1bp1
So there is ballooning in colbp1, with many nodes with >1 degree:
(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 0" colbp1.log
1670
(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 1" colbp1.log
2125
(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 2" colbp1.log
834
(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 3" colbp1.log
624
(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 4" colbp1.log
1401
Looking at subkmers in start kmers of colbp1 but not colbp0. These have some high degree nodes:
python ../misc/path_finder.py colbp1.log ATAAAGAGATCGGAAGAGCACACGTCTGAA
So working hypothesis is that colbp1 balloons because it has starting kmers that colbp0 does not, and these have high degree nodes. There is a bug that leads some valid paths to not be revisited for extension when another part of the graph is ballooning.
It is notable that some of these high degree nodes do not have particularly high edge counts. Suggests that filtering edges based on edge counts would not necessarily resolve ballooning.
Consider using Stable graph so edges don't change - https://docs.rs/petgraph/latest/petgraph/stable_graph/struct.StableGraph.html
Sam found that when the following is run:
The
co1bp0
primer pair gives a product and theco1bp1
does not, even thoughco1bp1
has the same reverse primer and the forward primer differs only by the addition of one more nucleotide, which should have the effect of advancing the trimmed primer by one. That advanced nucleotide is in the product given by the first primer. So the second primer is a nested primer.This is the kmer containing the primer in that first primer pair:
And this is the kmer containing the primer in the second primer pair: CATAAAGATATAGGAACATTATACCTAGTCT, count 16, keep true
So they have the same count.