Closed trvrb closed 9 years ago
I'll test this now.
great. I was a little overambitious and deleted a few things that ended up being needed. we can discuss overall organization of this and adjust as appropriate. similar modular structure could be done for the seq_util and tree_refine. (having H1N1 and other viruses in mind)
Eventually, it would be ideal to have .json control file for virus specific parameters in both augur and auspice.
But you would still need virus specific code, right? There are now vaccine strains for ebola or defined epitopes, gisaid parsing would always be flu specific etc. that can of course all be controlled with a json, but I think we need to provide the right structure.
Oh. Yeah. I think you're right. More elegant to write a few lines of virus-specific Python than to cram everything into some domain-specific json encoding of what needs to happen.
I envison smth like if control_json['virus']=='H3N2': import H3N2_filter as virus_filter else if .... import noro_filter as virus_filter ...
Cool. I like it.
Is there a reason you went with ../source-data/geo_synonyms.tsv
rather than source-data/geo_synonyms.tsv
? To run process.py
I now need to use augur/src/
as my working directory rather than augur/
.
just revert back. I meant to ask you from where you run it. I was mainly playing around in augur/src
Okay. Thanks.
I had just tried to run this from commit 66f41afd46ea2ae3f93ef6f8a4175d3e4c06bf20. However, bernoulli_frequency.py
is pulling up pdb
at line 197.
Sorry, it's actually farther down at line 226:
--- adding frequencies to tree global 18:16:27 ---
> /Users/trvrb/Dropbox/current-projects/nextflu/augur/src/bernoulli_frequency.py(226)estimate_sub_frequencies()
-> frequency_left *= (1.0-logit_inv(fe.pivot_freq))
(Pdb) c
Traceback (most recent call last):
File "src/bernoulli_frequency.py", line 550, in <module>
main()
File "src/bernoulli_frequency.py", line 544, in main
estimate_tree_frequencies(tree, threshold = 10, regions=regions, region_name=region_label)
File "src/bernoulli_frequency.py", line 291, in estimate_tree_frequencies
estimate_sub_frequencies(tree.seed_node, all_dates, reverse_order, threshold = threshold, region_name = region_name)
File "src/bernoulli_frequency.py", line 244, in estimate_sub_frequencies
estimate_sub_frequencies(child, all_dates, tip_to_date_index, threshold, region_name)
File "src/bernoulli_frequency.py", line 226, in estimate_sub_frequencies
frequency_left *= (1.0-logit_inv(fe.pivot_freq))
ValueError: non-broadcastable output operand with shape () doesn't match the broadcast shape (39,)
This is from running the default:
There's strains India/6352/2012
and Xinjiang-Tianshan/1411/2012
that shouldn't be there.
shouldn't be there by which criterion?
I had missed the
- node.ep = epitope_distance(translate(node.seq), translate(root.seq))
- node.ne = nonepitope_distance(translate(node.seq), translate(root.seq))
- node.rb = receptor_binding_distance(translate(node.seq), translate(root.seq))
+ node.ep = epitope_distance(node.aa_seq, root.aa_seq)
+ node.ne = nonepitope_distance(node.aa_seq, root.aa_seq)
+ node.rb = receptor_binding_distance(node.aa_seq, root.aa_seq)
fix. I'll try running again.
shouldn't be there by which criterion?
The previous algorithm was deterministic. I hadn't noticed these outliers because of it.
I see. but the previous also had a shuffle somewhere. (before the length sort, I believe.
if you run it with --test, it will only load the tree, you can the do
my_nextflu.refine_tree()
and see whether the ep, ne, rb is there correctly (frequencies will require a rerun since the mutations are different)
I see. but the previous also had a shuffle somewhere. (before the length sort, I believe.
I think it's important to preserve the length criterion. This is just like preferring strains with HI data (which actually will be just HA1 because that's how the CCs roll). I'll revise the subsampling to pick viruses in order within each region but select regions randomly. I'll sort priority_viruses
to the top of the list in each region.
if you run it with --test, it will only load the tree, you can the do
my_nextflu.refine_tree()
and see whether the ep, ne, rb is there correctly (frequencies will require a rerun since the mutations are different)
Noob question, but what do you mean "and then you can do"? Is this running in the debugger? I don't know this side of python.
I do pretty much everything in interactive ipython. this then looks somewhat like this:
In [1]: run src/nextflu_process.py --test In [2]: len(my_nextflu.viruses) Out[2]: 401
In [3]: my_nextflu.refine_tree() --- Tree refine at 08:32:42 --- Remove outgroup outgroup A/Beijing/32/1992 not found Remove outlier branches Collapse internal nodes Ladderize tree Translate nucleotide sequences
In [4]: my_nextflu.tree.seed_node.aa_seq[:10] Out[4]: 'QKLPGNDNST'
that is why having return values and objects to inspect is very useful.
Wow. Yes. I need to get on this. Thanks for pointing me in the right direction.
I agree, we wan't the long ones. but we still do it before unique filtering. lines 25-35 in virus_filter.py
but of what remains, we sample randomly...
I guess we need some useful compromise between geographic representation, high quality sequences, and other things like HI. the latter might only be important for some aspects and by commenting out the line 'force_include' in the config, the pipeline is not going to care.
calling it a day.
Okay. I've been staring at this for a while and trying to the think about the logic and organization here (trying to do a code review). My main issue is that:
nextflu_config.py
, virus_filter.py
, virus_clean.py
, H3N2_filter.py
all of which have some H3N2-specific methods.I propose the following idea:
virus_filter.py
is class that has a whole bunch of filtering methods attached to it. Then process_H3N2.py
contains all the method calls to properly filter H3N2 data. What's currently in nextflu_config.py
and H3N2_filter.py
ends up in process_H3.py
. It's just one file that can easily be walked through to see the whole H3N2 pipeline. There could eventually be a process_N2.py
, process_H5.py
or process_ebola.py
scripts.I guess in this case, virus_filter.py
would still have self.viruses
that would be filtered down by calling things like self.filter_passage()
, and at the end of which would be passed on to instantiate virus_align.py
, etc...
I know this basically 764eae82d77c9d6b9222915d57a2f3c1ddfb0c56 as opposed to what you have now. And I'm sorry about this. This is just seeming much cleaner.
we should discuss this (probably easier via phone/skype), but I am against a loose collection of functions form which you pick. the result is going to be that viruses specific stuff is all over the place. I am advocating a hierarchical structure with generic and specific things implemented via classes and subclasses.
I think we totally agree about not wanting virus-specific methods spread all over the place. I was trying to think of a solution that consolidates H3-specific methods into a single class and H1-specific methods into a different class. I agree that it could make sense to have a flu-specific methods class to avoid code duplication.
I just didn't like how far I was having to go track down what's actually happening.
My analogy here is aiming for something like a BEAST XML file, where the specific analysis pipeline can be completely read by walking through the XML. Many of the these steps fairly large. But not distributed across a host of config
and H3N2_filter
files.
I'll try to think about nextflu_process
as a starting point. What about a process
class and a flu_process
subclass and a H3_process
subclass of this? Is this the level of hierarchy you were imagining? Or just process
and H3_process
?
I agree about talking this over.
So, subclassing at the process
level rather than the filter
, clean
, etc... level we're doing at the moment.
we could also retain the current virus_filter and tree_refine as generic tools, then have the h3n2 subclasses of this all in one file
in addition, we add to the H3N2.py then a process class (which will be largly problem specific anyway and just a few lines which inherits the filter and refine
class H3N2_process(H3N2_filter, H3N2_refine):
def __init__(self...)
def run(self):
self.filter() # inherited from H3N2_fitler.
self.align()
...
when do you want to talk?
I like these suggestions a lot. I can talk in ~1 hour if it works for you.
yes, this would work.
some other suggestions:
This was incorporated as part of #50.
Code cleanup