Clean up - Githubissues

trvrb commented 9 years ago

Code cleanup

trvrb commented 9 years ago

I'll test this now.

rneher commented 9 years ago

great. I was a little overambitious and deleted a few things that ended up being needed. we can discuss overall organization of this and adjust as appropriate. similar modular structure could be done for the seq_util and tree_refine. (having H1N1 and other viruses in mind)

trvrb commented 9 years ago

Eventually, it would be ideal to have .json control file for virus specific parameters in both augur and auspice.

rneher commented 9 years ago

But you would still need virus specific code, right? There are now vaccine strains for ebola or defined epitopes, gisaid parsing would always be flu specific etc. that can of course all be controlled with a json, but I think we need to provide the right structure.

trvrb commented 9 years ago

Oh. Yeah. I think you're right. More elegant to write a few lines of virus-specific Python than to cram everything into some domain-specific json encoding of what needs to happen.

rneher commented 9 years ago

I envison smth like if control_json['virus']=='H3N2': import H3N2_filter as virus_filter else if .... import noro_filter as virus_filter ...

trvrb commented 9 years ago

Cool. I like it.

trvrb commented 9 years ago

Is there a reason you went with ../source-data/geo_synonyms.tsv rather than source-data/geo_synonyms.tsv? To run process.py I now need to use augur/src/ as my working directory rather than augur/.

rneher commented 9 years ago

just revert back. I meant to ask you from where you run it. I was mainly playing around in augur/src

trvrb commented 9 years ago

Okay. Thanks.

trvrb commented 9 years ago

I had just tried to run this from commit 66f41afd46ea2ae3f93ef6f8a4175d3e4c06bf20. However, bernoulli_frequency.py is pulling up pdb at line 197.

trvrb commented 9 years ago

Sorry, it's actually farther down at line 226:

--- adding frequencies to tree global 18:16:27 ---
> /Users/trvrb/Dropbox/current-projects/nextflu/augur/src/bernoulli_frequency.py(226)estimate_sub_frequencies()
-> frequency_left *= (1.0-logit_inv(fe.pivot_freq))
(Pdb) c
Traceback (most recent call last):
  File "src/bernoulli_frequency.py", line 550, in <module>
    main()
  File "src/bernoulli_frequency.py", line 544, in main
    estimate_tree_frequencies(tree, threshold = 10, regions=regions, region_name=region_label)
  File "src/bernoulli_frequency.py", line 291, in estimate_tree_frequencies
    estimate_sub_frequencies(tree.seed_node, all_dates, reverse_order, threshold = threshold, region_name = region_name)
  File "src/bernoulli_frequency.py", line 244, in estimate_sub_frequencies
    estimate_sub_frequencies(child, all_dates, tip_to_date_index, threshold, region_name)
  File "src/bernoulli_frequency.py", line 226, in estimate_sub_frequencies
    frequency_left *= (1.0-logit_inv(fe.pivot_freq))
ValueError: non-broadcastable output operand with shape () doesn't match the broadcast shape (39,)

trvrb commented 9 years ago

This is from running the default:

tree

There's strains India/6352/2012 and Xinjiang-Tianshan/1411/2012 that shouldn't be there.

rneher commented 9 years ago

shouldn't be there by which criterion?

trvrb commented 9 years ago

I had missed the

-       node.ep = epitope_distance(translate(node.seq), translate(root.seq))
-       node.ne = nonepitope_distance(translate(node.seq), translate(root.seq))
-       node.rb = receptor_binding_distance(translate(node.seq), translate(root.seq))
+       node.ep = epitope_distance(node.aa_seq, root.aa_seq)
+       node.ne = nonepitope_distance(node.aa_seq, root.aa_seq)
+       node.rb = receptor_binding_distance(node.aa_seq, root.aa_seq)

fix. I'll try running again.

trvrb commented 9 years ago

shouldn't be there by which criterion?

The previous algorithm was deterministic. I hadn't noticed these outliers because of it.

rneher commented 9 years ago

I see. but the previous also had a shuffle somewhere. (before the length sort, I believe.

rneher commented 9 years ago

if you run it with --test, it will only load the tree, you can the do

my_nextflu.refine_tree()

and see whether the ep, ne, rb is there correctly (frequencies will require a rerun since the mutations are different)

trvrb commented 9 years ago

I see. but the previous also had a shuffle somewhere. (before the length sort, I believe.

I think it's important to preserve the length criterion. This is just like preferring strains with HI data (which actually will be just HA1 because that's how the CCs roll). I'll revise the subsampling to pick viruses in order within each region but select regions randomly. I'll sort priority_viruses to the top of the list in each region.

trvrb commented 9 years ago

if you run it with --test, it will only load the tree, you can the do my_nextflu.refine_tree() and see whether the ep, ne, rb is there correctly (frequencies will require a rerun since the mutations are different)

Noob question, but what do you mean "and then you can do"? Is this running in the debugger? I don't know this side of python.

rneher commented 9 years ago

I do pretty much everything in interactive ipython. this then looks somewhat like this:

In [1]: run src/nextflu_process.py --test In [2]: len(my_nextflu.viruses) Out[2]: 401

In [3]: my_nextflu.refine_tree() --- Tree refine at 08:32:42 --- Remove outgroup outgroup A/Beijing/32/1992 not found Remove outlier branches Collapse internal nodes Ladderize tree Translate nucleotide sequences

In [4]: my_nextflu.tree.seed_node.aa_seq[:10] Out[4]: 'QKLPGNDNST'

that is why having return values and objects to inspect is very useful.

trvrb commented 9 years ago

Wow. Yes. I need to get on this. Thanks for pointing me in the right direction.

rneher commented 9 years ago

I agree, we wan't the long ones. but we still do it before unique filtering. lines 25-35 in virus_filter.py

but of what remains, we sample randomly...

rneher commented 9 years ago

I guess we need some useful compromise between geographic representation, high quality sequences, and other things like HI. the latter might only be important for some aspects and by commenting out the line 'force_include' in the config, the pipeline is not going to care.

rneher commented 9 years ago

calling it a day.

trvrb commented 9 years ago

Okay. I've been staring at this for a while and trying to the think about the logic and organization here (trying to do a code review). My main issue is that:

I don't like how scattered virus-specific configuration is going to be. We're looking at nextflu_config.py, virus_filter.py, virus_clean.py, H3N2_filter.py all of which have some H3N2-specific methods.

I propose the following idea:

Something like virus_filter.py is class that has a whole bunch of filtering methods attached to it. Then process_H3N2.py contains all the method calls to properly filter H3N2 data. What's currently in nextflu_config.py and H3N2_filter.py ends up in process_H3.py. It's just one file that can easily be walked through to see the whole H3N2 pipeline. There could eventually be a process_N2.py, process_H5.py or process_ebola.py scripts.

I guess in this case, virus_filter.py would still have self.viruses that would be filtered down by calling things like self.filter_passage(), and at the end of which would be passed on to instantiate virus_align.py, etc...

I know this basically 764eae82d77c9d6b9222915d57a2f3c1ddfb0c56 as opposed to what you have now. And I'm sorry about this. This is just seeming much cleaner.

rneher commented 9 years ago

we should discuss this (probably easier via phone/skype), but I am against a loose collection of functions form which you pick. the result is going to be that viruses specific stuff is all over the place. I am advocating a hierarchical structure with generic and specific things implemented via classes and subclasses.

trvrb commented 9 years ago

I think we totally agree about not wanting virus-specific methods spread all over the place. I was trying to think of a solution that consolidates H3-specific methods into a single class and H1-specific methods into a different class. I agree that it could make sense to have a flu-specific methods class to avoid code duplication.

I just didn't like how far I was having to go track down what's actually happening.

My analogy here is aiming for something like a BEAST XML file, where the specific analysis pipeline can be completely read by walking through the XML. Many of the these steps fairly large. But not distributed across a host of config and H3N2_filter files.

I'll try to think about nextflu_process as a starting point. What about a process class and a flu_process subclass and a H3_process subclass of this? Is this the level of hierarchy you were imagining? Or just process and H3_process?

I agree about talking this over.

trvrb commented 9 years ago

So, subclassing at the process level rather than the filter, clean, etc... level we're doing at the moment.

rneher commented 9 years ago

we could also retain the current virus_filter and tree_refine as generic tools, then have the h3n2 subclasses of this all in one file

in addition, we add to the H3N2.py then a process class (which will be largly problem specific anyway and just a few lines which inherits the filter and refine

 class H3N2_process(H3N2_filter, H3N2_refine):
          def __init__(self...)

       def run(self):
            self.filter()   # inherited from H3N2_fitler. 
            self.align()

...

when do you want to talk?

trvrb commented 9 years ago

I like these suggestions a lot. I can talk in ~1 hour if it works for you.

rneher commented 9 years ago

yes, this would work.

rneher commented 9 years ago

some other suggestions:

while in augur, use pickle instead of json
viruses could be an alignment object. this would have all the io methods we need
add some statistic/diagnostic functions to the pipeline, i.e. summarize how many viruses from whcih region isn whcih month are included and write to txt

trvrb commented 9 years ago

This was incorporated as part of #50.

blab / nextflu

Clean up #43