caseywdunn / sharkmer

3 stars 0 forks source link

nested primer gives no product #2

Closed caseywdunn closed 10 months ago

caseywdunn commented 11 months ago

Sam found that when the following is run:

sharkmer -k 31 --verbosity 10 -m 800000 -n 100 -t 8 -s YPM-IZ-104464 --pcr "TCATAAAGATATAGG_GTGACCAAAAAACCA_1000_co1bp0" --pcr "TCATAAAGATATAGGA_GTGACCAAAAAACCA_1000_co1bp1" .\NA24_164_029_S8_L002_R1_001_1M.fastq

The co1bp0 primer pair gives a product and the co1bp1 does not, even though co1bp1 has the same reverse primer and the forward primer differs only by the addition of one more nucleotide, which should have the effect of advancing the trimmed primer by one. That advanced nucleotide is in the product given by the first primer. So the second primer is a nested primer.

This is the kmer containing the primer in that first primer pair:

TCATAAAGATATAGGAACATTATACCTAGTC, count 16, keep true

And this is the kmer containing the primer in the second primer pair: CATAAAGATATAGGAACATTATACCTAGTCT, count 16, keep true

So they have the same count.

caseywdunn commented 11 months ago

Here is tail of output of first pcr:

The count of kmers containing primers have high coverage 22 relative to the coverage threshold of 3.  Increasing min coverage to 4.
Creating graph, seeding with nodes that contain primer matches...
There are 10 start nodes
There are 8 end nodes
Extending the assembly graph...
Final extension statistics:
    There are 5084 nodes in the graph
    There are 5098 edges in the graph
    There are 17 components in the graph
    The graph does not have cycles
    Number of descendants to a depth of 4, max 13 mean 4.04 median 4.0
  There are 10 start nodes with edges
  There are 1 end nodes with edges
done.  Time to extend graph: 154.3497ms
Pruning the assembly graph...
  There are 659 nodes in the graph
  There are 658 edges in the graph
  There are 1 start nodes
  There are 1 end nodes
  There are 1 components in the graph
done.  Time to prune graph: 5.155ms
Traversing the assembly graph to find paths from forward to reverse primers...
  There are 1 paths from forward to reverse primers in the graph
done.  Time to traverse graph: 284.7µs
Generating sequences from paths...
>YPM-IZ-104464 co1bp0 product 0 length 688 kmer count stats mean 16.91 median 15 min 10 max 29
TCATAAAGATATAGGAACATTATACCTAGTCTTTGGTTTATTTTCAGGTATGGTAGGAACTGCTCTTAGTATGTTAATCAGGTTAGAGTTATCAGGACCCGGTACTATGTTTGGAGATGATCACCTTTATAATGTTATAGTAACTGCTCATGCCTTTGTCATGATCTTTTTCCTTGTAATGCCAGTCCTAATCGGAGGCTTCGGTAACTGGTTCGTACCCCTATTTATAGGTGCTCCGGATATGGCCTTCCCTAGGTTGAACAACCTAAGTTTTTGATTATTACCCCCTGCCTTATTACTACTATTAGGGTCATCCTTGATAGAACAAGGTGCAGGAACTGGTTGAACTGTTTACCCTCCTTTGTCTGGCCCCCAAACTCATTCTGGGGGATCAGTTGATATGGCTATTTTCAGTTTACACTGTGCGGGTGCCTCCTCAATTATGGGTGCTATTAACTTCATAACCACTATATTTAATATGAGGGCCCCTGGTATGACTATGGACAAGTTACCATTATTTGTCTGATCAGTTTTGATAACTGCCTTCCTCTTATTACTATCATTACCCGTGTTGGCCGGAGCTATAACTATGTTACTTACTGATAGAAATTTTAATACTACTTTCTTCGACCCTGCGGGAGGTGGTGATCCAGTTCTCTATCAACATTTGTTCTGGTTTTTTGGTCAC
done.
For gene co1bp0, 1 PCR products were generated and retained.
caseywdunn commented 11 months ago

Here is tail of output for second PCR:

Running PCR on gene co1bp1
Removing kmers with coverage less than 3...
  The number of unique kmers went from 50442347 to 3201796
Expanding the forward primer into all variants
  Trimming the primer to CATAAAGATATAGGA so that it is within the trim length of 15.
  There are 991 variants of the primer
Expanding the reverse primer into all variants
  There are 991 variants of the primer
Finding kmers that contain the forward primer
  TATAAAGATATCGGATCGCAGCAATCATTAA, count 3, keep false
  CATAAAGATATAAAATTTATTGAAGATACTC, count 7, keep true
  CATAAATATATAGGTTTATTAAAAGCGATAA, count 22, keep true
  CATAAAAATATATGATATATTAAAAAGAAAA, count 8, keep true
  CATAAACATTTAGGAATGGTTCTTGATTCCA, count 5, keep true
  CATAAAGATATAGGAACATTATACCTAGTCT, count 16, keep true
  CATAAAGAGATCGGAAGAGCACACGTCTGAA, count 9, keep true
  CATAAACATTTAGGAATGATTTTAGATAACA, count 4, keep true
  CAAAAAGATATATGAATAGTTTTTTTCCAGA, count 5, keep true
  CATAAAGATATAGGCCCTTCAAAACGAAGAA, count 8, keep true
  CATAAAAATAAAGGAATTACTATTGTTCAAG, count 3, keep false
  CATAAAGATATCAGAAAATACTAAGCTCATT, count 4, keep true
  CATAAATATGTAGGAATATATTAGCCAAGTA, count 3, keep false
  CATAAAGAGATAGCAAAACACATGGAGGGAT, count 4, keep true
Finding kmers that contain the reverse primer
  GTCAATGCAAAAAAGCTGGTTTTTTGTGCAC, count 4, keep true
  TAGGAGTTGCTCTTAATGGTGTTTTGGTAAC, count 3, keep true
  GTCATAAAAACAGACATGGTTTTTTGGATAC, count 9, keep true
  ACCATAGGTGCATAATTGTTTTTTTGGCCAC, count 3, keep true
  TTTCTTGTATATGTCTTGTTTTTTTGGTCCC, count 12, keep true
  TTATACTAAAAGTAAATACTTTTTTGGTCAC, count 3, keep true
  CTATCAACATTTGTTCTGGTTTTTTGGTCAC, count 25, keep true
  GAATTTGGTGATTGGGTGGAGTTTTGGTCAC, count 3, keep true
The count of kmers containing primers have high coverage 22 relative to the coverage threshold of 3.  Increasing min coverage to 4.
Creating graph, seeding with nodes that contain primer matches...
There are 11 start nodes
There are 8 end nodes
Extending the assembly graph...
  Evaluating extension:
    There are 11280 nodes in the graph
    There are 11261 edges in the graph
    There are 19 components in the graph
    The graph does not have cycles
    Number of descendants to a depth of 4, max 340 mean 3.99 median 0.0
    Removing 57280 nodes descended from 108 nodes with ballooning graph extension
Final extension statistics:
    There are 1389 nodes in the graph
    There are 1370 edges in the graph
    There are 19 components in the graph
    The graph does not have cycles
    Number of descendants to a depth of 4, max 49 mean 3.90 median 4.0
  There are 11 start nodes with edges
  There are 0 end nodes with edges
done.  Time to extend graph: 96.7752ms
Pruning the assembly graph...
  There are 0 nodes in the graph
  There are 0 edges in the graph
  There are 0 start nodes
  There are 0 end nodes
  There are 0 components in the graph
done.  Time to prune graph: 595.2µs
Traversing the assembly graph to find paths from forward to reverse primers...
  There are 0 paths from forward to reverse primers in the graph
done.  Time to traverse graph: 193.7µs
For gene co1bp1, no path was found from a forward primer binding site to a reverse binding
site. Abandoning PCR.
  Suggested actions:
    - The max_length for the PCR product of 1000 my be too short. Consider increasing it.
    - The primers may have non-specific binding and are not close enough to generate a
      product. Consider increasing the primer TRIM length from the default to
      create a more specific primer.
      - The maximum count of a forward kmer is 22 and of a reverse kmer is 25. Large
        differences in value can indicate non-specific binding of one of the primers.
caseywdunn commented 11 months ago

Impacts on second run of advancing the primer include:

So it appears that the code that removes a ballooning subgraph is removing a valid path.

caseywdunn commented 11 months ago

This is odd:

  Evaluating extension:
    There are 11280 nodes in the graph
    There are 11261 edges in the graph
    There are 19 components in the graph
    The graph does not have cycles
    Number of descendants to a depth of 4, max 340 mean 3.99 median 0.0
    Removing 57280 nodes descended from 108 nodes with ballooning graph extension

The depth is 4, and 4^4 is 256. This should be the max number of descendants. But there are 340 max. And the median is 0. So moving primer forward by 1 hay have led evaluation to fall at a different time and revealed a bug in balloon pruning.

caseywdunn commented 11 months ago

Could be a bug in n_descendants(), or in the addition of nodes to the graph.

caseywdunn commented 10 months ago

Added some more node and edge summary stats to summarize_extension().

Summary with --pcr "TCATAAAGATATAGGA_GTGACCAAAAAACCA_1000_co1bp1":

Final extension statistics:
    There are 1372 nodes in the graph
    There are 1353 edges in the graph
    There are 19 components in the graph
    The graph does not have cycles
    Max node degree 4, mean 0.99 median 1.0
    Max edge count 80750, mean 703.85 median 10.0
    Number of descendants to a depth of 4, max 49 mean 3.90 median 4.0
  There are 11 start nodes with edges
  There are 0 end nodes with edges
done.  Time to extend graph: 142.401583ms

And for --pcr "TCATAAAGATATAGG_GTGACCAAAAAACCA_1000_co1bp0"

Final extension statistics:
    There are 5084 nodes in the graph
    There are 5098 edges in the graph
    There are 17 components in the graph
    The graph does not have cycles
    Max node degree 3, mean 1.00 median 1.0
    Max edge count 148, mean 15.12 median 13.0
    Number of descendants to a depth of 4, max 13 mean 4.04 median 4.0
  There are 10 start nodes with edges
  There are 1 end nodes with edges
done.  Time to extend graph: 216.921542ms

So the failure in co1bp1 is associated with edges with very high count. Consistent with ballooning being the result of incorporation of a high abundance kmer (eg an adapter sequence) that leads to a highly branched graph.

This suggests that I can mitigate ballooning by excluding edges for kmers with very high counts.

But I still need to figure out why shifting the start position leads to ballooning.

caseywdunn commented 10 months ago

Comparing the starting kmers of the two runs. Here they are for for co1bp0, indented are those that don't match co1bp1 from their second bp on:

Finding kmers that contain the forward primer
  TCATAAAGATATAGGAACATTATACCTAGTC, count 16, keep true
  TCATAAAAATATATGATATATTAAAAAGAAA, count 8, keep true
  TCATAAACATTTAGGTTTAATTCTTGACTCT, count 9, keep true
    TCATAAAGATATAAACTATATAGAAAACACT, count 4, keep true
  TCATAAAGATATAAAATTTATTGAAGATACT, count 7, keep true
  CCATAAAGATATAGGCCCTTCAAAACGAAGA, count 8, keep true
    TTATAAAGATATACGCCCGATTAAACATATT, count 14, keep true
  CCATAAATATATAGGTTTATTAAAAGCGATA, count 22, keep true
    TCATAAATATACAGGTCCTTACAATCCATTA, count 5, keep true
    TAATAAAGATATTGGTCTATAATTACTTTTT, count 5, keep true

The final product starts with TCATAAAGATATAGGAACATTATACCTAGTC, which is the first kmer

And here are co1bp1 starting kmers:

Finding kmers that contain the forward primer
  CATAAAGATATAAAATTTATTGAAGATACTC, count 7, keep true
  CATAAATATATAGGTTTATTAAAAGCGATAA, count 22, keep true
  CATAAAAATATATGATATATTAAAAAGAAAA, count 8, keep true
  CATAAACATTTAGGAATGGTTCTTGATTCCA, count 5, keep true
  CATAAAGATATAGGAACATTATACCTAGTCT, count 16, keep true
  CATAAACATTTAGGAATGATTTTAGATAACA, count 4, keep true
  CATAAAGAGATCGGAAGAGCACACGTCTGAA, count 9, keep true
  CAAAAAGATATATGAATAGTTTTTTTCCAGA, count 5, keep true
  CATAAAGATATAGGCCCTTCAAAACGAAGAA, count 8, keep true
  CATAAAGATATCAGAAAATACTAAGCTCATT, count 4, keep true
  CATAAAGAGATAGCAAAACACATGGAGGGAT, count 4, keep true

The kmer CATAAAGATATAGGAACATTATACCTAGTCT starts at the second basepair of the final product with co1bp0, so a viable starting primer is there.

caseywdunn commented 10 months ago

Both graphs have the subkmer ATAAAGATATAGGAACATTATACCTAGTCT. So something must happen that poisons this path in co1bp1

caseywdunn commented 10 months ago

Used the following to track graph extension from the verbose output:

../target/release/sharkmer -k 31 --verbosity 10 -m 800000 -n 100 -t 8 -s YPM-IZ-104464 --pcr "TCATAAAGATATAGG_GTGACCAAAAAACCA_1000_co1bp0" ~/palmer_scratch/tmp/NA24_164_029_S8_L002_R1_001_1M.fastq > colbp0.log
 ../target/release/sharkmer -k 31 --verbosity 10 -m 800000 -n 100 -t 8 -s YPM-IZ-104464 --pcr "TCATAAAGATATAGG_GTGACCAAAAAACCA_1000_co1bp1" ~/palmer_scratch/tmp/NA24_164_029_S8_L002_R1_001_1M.fastq > colbp1.log
python ../misc/path_finder.py colbp0.log ATAAAGATATAGGAACATTATACCTAGTCT
python ../misc/path_finder.py colbp1.log ATAAAGATATAGGAACATTATACCTAGTCT

In both cases, the full path of the gene was found. So it is not an extension problem - it is something about pruning. Could be for example that one of the different start kmers found with co1bp1 leads to ballooning, and the resulting pruning intended for other parts of the graph ends up clipping this desired path.

caseywdunn commented 10 months ago

The following subkmers are found in the start kmers of co1bp1 but not co1bp0. Full kmer shown in second column. Parsed from output above.

CATAAACATTTAGGAATGGTTCTTGATTCC  CATAAACATTTAGGAATGGTTCTTGATTCCA
CATAAACATTTAGGAATGATTTTAGATAAC  CATAAACATTTAGGAATGATTTTAGATAACA
CATAAAGAGATCGGAAGAGCACACGTCTGA  CATAAAGAGATCGGAAGAGCACACGTCTGAA
CAAAAAGATATATGAATAGTTTTTTTCCAG  CAAAAAGATATATGAATAGTTTTTTTCCAGA
CATAAAGATATCAGAAAATACTAAGCTCAT  CATAAAGATATCAGAAAATACTAAGCTCATT
CATAAAGAGATAGCAAAACACATGGAGGGA  CATAAAGAGATAGCAAAACACATGGAGGGAT

None of the sub kmers from the left full kmers are present in the path in the verbose output. Suggests that they are not extended at all.

caseywdunn commented 10 months ago

There is no evidence of ballooning in colbp1.log, would expect to see more subkmers with 3-4 candidates for extension:

(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 1" colbp1.log
4965
(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 2" colbp1.log
69
(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 3" colbp1.log
2
(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 4" colbp1.log
0
(isoseq)[cwd7@r206u22n01.mccleary tmp]$ grep -c "There are 0" colbp1.log
38
caseywdunn commented 10 months ago

Calls for co1bp1 were wrong in last few experiments, should be --pcr "TCATAAAGATATAGGA_GTGACCAAAAAACCA_1000_co1bp1"

caseywdunn commented 10 months ago

Last few lines of python ../misc/path_finder.py colbp1.log ATAAAGATATAGGAACATTATACCTAGTCT are:

  CTAGTCTTTGGTTTATTTTCAGGTATGGTA sub_kmer being extended for node 1558. There are 1 candidate kmers for extension. Added sub_kmer TAGTCTTTGGTTTATTTTCAGGTATGGTAG for new node 1927 with edge kmer count 14. Path length is 25. There are now 369 unvisited and 1638 non-terminal nodes in the graph. 
  TAGTCTTTGGTTTATTTTCAGGTATGGTAG sub_kmer being extended for node 1927. There are 1 candidate kmers for extension. Added sub_kmer AGTCTTTGGTTTATTTTCAGGTATGGTAGG for new node 2324 with edge kmer count 14. Path length is 26. There are now 397 unvisited and 1841 non-terminal nodes in the graph. 
  AGTCTTTGGTTTATTTTCAGGTATGGTAGG sub_kmer being extended for node 2324. There are 1 candidate kmers for extension. Added sub_kmer GTCTTTGGTTTATTTTCAGGTATGGTAGGA for new node 3388 with edge kmer count 14. Path length is 27. There are now 1064 unvisited and 2801 non-terminal nodes in the graph. 
  GTCTTTGGTTTATTTTCAGGTATGGTAGGA sub_kmer being extended for node 3388. There are 1 candidate kmers for extension. Added sub_kmer TCTTTGGTTTATTTTCAGGTATGGTAGGAA for new node 6651 with edge kmer count 14. Path length is 28. There are now 3263 unvisited and 6007 non-terminal nodes in the graph. 

So it adds TCTTTGGTTTATTTTCAGGTATGGTAGGAA, which is only at a path length of 28. That is the only line that this subkmer is found on in the colbp1.log log - it is never revisited. So for some reason extension stops.

It is extended just fine in the co1bp0 run:

$ python ../misc/path_finder.py colbp0.log TCTTTGGTTTATTTTCAGGTATGGTAGGAA
File: colbp0.log
Seed: TCTTTGGTTTATTTTCAGGTATGGTAGGAA
TCTTTGGTTTATTTTCAGGTATGGTAGGAA
  TCTTTGGTTTATTTTCAGGTATGGTAGGAA sub_kmer being extended for node 240. There are 1 candidate kmers for extension. Added sub_kmer CTTTGGTTTATTTTCAGGTATGGTAGGAAC for new node 247 with edge kmer count 14. Path length is 30. There are now 7 unvisited and 236 non-terminal nodes in the graph. 
  CTTTGGTTTATTTTCAGGTATGGTAGGAAC sub_kmer being extended for node 247. There are 1 candidate kmers for extension. Added sub_kmer TTTGGTTTATTTTCAGGTATGGTAGGAACT for new node 254 with edge kmer count 13. Path length is 31. There are now 7 unvisited and 243 non-terminal nodes in the graph. 
...

So for some reason it is never revisited in co1bp1

caseywdunn commented 10 months ago

So there is ballooning in colbp1, with many nodes with >1 degree:

(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 0" colbp1.log 
1670
(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 1" colbp1.log 
2125
(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 2" colbp1.log 
834
(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 3" colbp1.log 
624
(isoseq)[cwd7@r206u22n02.mccleary tmp]$ grep -c "There are 4" colbp1.log 
1401
caseywdunn commented 10 months ago

Looking at subkmers in start kmers of colbp1 but not colbp0. These have some high degree nodes:

python ../misc/path_finder.py colbp1.log ATAAAGAGATCGGAAGAGCACACGTCTGAA

So working hypothesis is that colbp1 balloons because it has starting kmers that colbp0 does not, and these have high degree nodes. There is a bug that leads some valid paths to not be revisited for extension when another part of the graph is ballooning.

It is notable that some of these high degree nodes do not have particularly high edge counts. Suggests that filtering edges based on edge counts would not necessarily resolve ballooning.

caseywdunn commented 10 months ago

Consider using Stable graph so edges don't change - https://docs.rs/petgraph/latest/petgraph/stable_graph/struct.StableGraph.html