arborworkflows / aRbor

aRbor, an R package with useful functions for Arbor workflows
5 stars 3 forks source link

create general na functions for treedata #25

Open lukejharmon opened 9 years ago

lukejharmon commented 9 years ago

I think we need three cases: single column (checks for NAs, removes from data and tree as needed); pairwise (removes any taxa not present in BOTH, for things like PGLS); and multivariate (removes any incomplete taxa, for things like phyloPCA).

lukejharmon commented 9 years ago

@uyedaj thoughts?

lukejharmon commented 9 years ago

Actually this approach won't work with the lapply approach that is used by aceArbor and others. Ergh.

uyedaj commented 9 years ago

If want to feed an entire data frame and run it for each column, then the functions need to take care of each column individually (as done in aceArbor), but the other possibilities can be filtered using treeplyFilter right now. We could automate this process with a friendlier wrapper that allowed you to filter selected columns for NA's and select Boolean operators ( "or" or "and"). That way we could cover the possiblities you list.

lukejharmon commented 9 years ago

yeah that's great. the filtering is not really being used heavily now in aceArbor, maybe persistent issues like #15.

curtislisle commented 9 years ago

Would it be best if we coordinated this character & column management between the Romanesco and aRbor layers? I understand if you guys want to standardize it all at the R level to allow aRbor to be functional outside of Arbor proper, just wondering…

On Sep 18, 2014, at 9:34 AM, Josef Uyeda notifications@github.com wrote:

If want to feed an entire data frame and run it for each column, then the functions need to take care of each column individually (as done in aceArbor), but the other possibilities can be filtered using treeplyFilter right now. We could automate this process with a friendlier wrapper that allowed you to a) filter selected columns for NA's and select Boolean operators ( "or" or "and"). That way we could cover the possiblities you list.

— Reply to this email directly or view it on GitHub.

uyedaj commented 9 years ago

Yes, so here is my thought:

I think we need both. I think it's great if we have common operations done on data frames and trees available at the aRbor level (like eliminating NAs, filtering by category, select rows by condition etc.).

These are duplicated right now in my treeplyr functions, and I don't think treeplyr should replace these most of the time. Where the treeplyr functions are really useful is that they can take any R expression, or combination of R expressions, to filter, select, mutate, or apply a function to a data frame/tree/tree+data.frame. This allows the user in aRbor quickly to apply a function to their data that we wouldn't want to implement as a stand alone function, because it would be too idiosyncratic to their particular purpose (e.g. 'if(island=='Cuba') {SVL * 10}' because your collaborator who measured Cuban anoles measured in centimeters rather than millimeters). Having a specific function for every imaginable operation isn't feasible.

curtislisle commented 9 years ago

Agreed. I like the flexibility of having the power at both levels. To me, it seems like Arbor will gradually evolve into having different “collections” of operations. Some will be simple wrappers above the treeplyr/aRbor/rotl layer and others might be more involved at the work step algorithm level. This way there would be simple block collections and “power user” block collections available.

A take away for your standup talks today could discuss how to create these separate “collections” of operations.

On Sep 18, 2014, at 10:23 AM, Josef Uyeda notifications@github.com wrote:

Yes, so here is my thought:

I think we need both. I think it's great if we have common operations done on data frames and trees available at the aRbor level (like eliminating NAs, filtering by category, select rows by condition etc.).

These are duplicated right now in my treeplyr functions, and I don't think treeplyr should replace these most of the time. Where the treeplyr functions are really useful is that they can take any R expression, or combination of R expressions, to filter, select, mutate, or apply a function to a data frame/tree/tree+data.frame. This allows the user in aRbor quickly to apply a function to their data that we wouldn't want to implement as a stand alone function, because it would be too idiosyncratic to their particular purpose (e.g. 'if(island=='Cuba') {SVL * 10}' because your collaborator who measured Cuban anoles measured in centimeters rather than millimeters).

— Reply to this email directly or view it on GitHub.