NicolaDM / phastSim

fast sequence evolution simulation for SARS-CoV-2 and similar
GNU General Public License v3.0
16 stars 5 forks source link

Verbose output causes error and issues with mutation type output #13

Closed jmcbroome closed 2 years ago

jmcbroome commented 2 years ago

Hi folks, I've been testing an analysis recently that needs simulation of natural neutral nucleotide mutations across the genome with MAT output. I discovered to my dismay that the genome seems to be initialized as a string of all "A"s at the root, with every single mutation event outputting as A to C, G, or T. When I tried to use the "--verbose" output to elucidate the issue, it threw an error.

The command I used that produces all "A>" mutations is

python3 ~/Desktop/phastSim/bin/phastSim --output sim --reference NC_045512v2.fa --scale 0.01 --createMAT --treeFile newick_output.nwk --eteFormat 0 --mutationRates UNREST 0.008 0.09 0.013 0.012 0.003 0.54 0.055 0.003 0.069 0.015 0.181 0.01 

Simply adding "--verbose" to the end of this causes the following error:

Traceback (most recent call last):
  File "/Users/jmcbr/Desktop/phastSim/bin/phastSim", line 175, in <module>
    genome_tree.mutateBranchETEhierarchy(t, genome_tree.genomeRoot, 1, sim_run.args.createNewick, preMutationsBranches)
  File "/Users/jmcbr/opt/anaconda3/envs/sim/lib/python3.9/site-packages/phastSim/phastSim.py", line 1527, in mutateBranchETEhierarchy
    self.mutateBranchETEhierarchy(c, newGenomeNode, level + 1, createNewick, preMutationsBranches)
  File "/Users/jmcbr/opt/anaconda3/envs/sim/lib/python3.9/site-packages/phastSim/phastSim.py", line 1496, in mutateBranchETEhierarchy
    mutEvent = self.findPos(rand, newGenomeNode, level)
  File "/Users/jmcbr/opt/anaconda3/envs/sim/lib/python3.9/site-packages/phastSim/phastSim.py", line 1374, in findPos
    mutEvent = self.findPos(rand, newChild, level)
  File "/Users/jmcbr/opt/anaconda3/envs/sim/lib/python3.9/site-packages/phastSim/phastSim.py", line 1374, in findPos
    mutEvent = self.findPos(rand, newChild, level)
  File "/Users/jmcbr/opt/anaconda3/envs/sim/lib/python3.9/site-packages/phastSim/phastSim.py", line 1374, in findPos
    mutEvent = self.findPos(rand, newChild, level)
  [Previous line repeated 12 more times]
  File "/Users/jmcbr/opt/anaconda3/envs/sim/lib/python3.9/site-packages/phastSim/phastSim.py", line 1333, in findPos
    f" hyperCat {node.hyperCategories}"
AttributeError: 'genomeNode' object has no attribute 'hyperCategories'

While the issue with the reference seemingly being initialized as a string of 'A's instead of matching the reference fasta passed may be with how I've laid out my parameters (I would very much appreciate guidance on this front), I'm confident that the verbose output error is a true bug that should be addressed.

Thanks for all your hard work!

willboulton commented 2 years ago

Dear Jakob,

Thanks very much for submitting these issues! The error due to the --verbose argument is indeed a bug, I'll push some updates to the main branch including a fix for this issue.

Regarding the MAT output format it is great that this option is coming in useful, though of course annoying that it appears not to work properly. I think this is also a bug too due to the ref_nuc and par_nuc attributes in the protobuf not being set correctly and so defaulting to 0 as A. I'll include a fix for this too along with the other fix for the --verbose flag.

Best wishes, Will (wboulton12@gmail.com)

willboulton commented 2 years ago

Dear Jakob,

I've just pushed fixes for both of these two issues you submitted to the main repository.

Regarding the MAT format, the issue was to do with the fact that the protocol buffer field par_nuc of a mut object was set as repeated in phastSim, but is not in MatUtils. I had originally made this field repeated because I wanted to extend the MAT format to indels (at least in principle), and from the Google documentation it seemed as though this change should have been backwards compatible. However, it doesn't seem to be compatible, so I have used a protocol buffer almost identical to the MatUtils PB instead.

It would be great to hear your advice on how indels should be represented in a MAT; in phastSim, an indel is represented in a MAT by using -1 to represent a gap character. However, nested/overlapping mutations are quite tricky. This format is probably completely unreadable for MatUtils but at least in principle the data is there - and for simulations not involving indels, the MAT produced by phastSim should be compatible with MatUtils.

Best wishes, I imagine this will not be the last bug!

Will

jmcbroome commented 2 years ago

Ah, so it was an issue with the MAT output specifically! I'm so glad to hear that it's fixed- phastSim and VGSim together with MAT output have been excellent tools for generating null models when experimenting with statistics and approaches for associating epidemiological characteristics with phylogenetics. This will be especially helpful for grounding some idea I have about identifying directional selection.

In regards to indels, that's a difficult question that's currently under debate in our group. We need some way to track updates to coordinate changes relative to the reference, and things get specifically thorny when indels are nested. The most basic solution we've discussed is one where indels are represented as a character string and a starting coordinate, and when a new indel occurs overlapping an old one, we infer the new sequence and represent it again as a starting coordinate and a character string (e.g. an insertion of ACCG at coordinate 10, then TGG at coordinate 11, would lead to a downstream insertion of the sequence of ATGGCCG at coordinate 10 relative to the reference, and the MAT would have this event on the latter branch). This is simple, and likely would work okay for SARS-CoV-2 which has a relatively low rate of insertion/deletion events, but it has serious issues when large sections of the genome are inserted or deleted, prevents the proper tracking of single nucleotide mutations within indel sequences (which are just represented as updates to the character string), and quickly moves away from the efficiency we prize in our kit. A better solution would require a significant refactor of how states are constructed across the tree such that the coordinate system can be updated as the tree is traversed. We are potentially looking at putting a grad student on specifically this task over the coming year.

If you'd like, we can arrange a zoom meeting, or at least a set of emails, with Russ and a rotation student we've had working on the indel issue, sometime in the next couple weeks. They'll be able to explain where we're at more clearly. Reach out to me at jmcbroom@ucsc.edu if you're interested and we can try to coordinate something.

jmcbroome commented 2 years ago

On cursory inspection, looks like the genome is initialized and stored in the MAT correctly! Thanks