caporaso-lab / genome-sampler

https://caporasolab.us/genome-sampler/
BSD 3-Clause "New" or "Revised" License
5 stars 10 forks source link

ENH/API: add a method to compute sliding windows over features #81

Open wasade opened 4 years ago

wasade commented 4 years ago

This pull request adds the ability to compute temporal sliding windows within geographic regions, and is compatible with GISAID metadata. The output of this method can be interpreted as the features observed in a location during a time period, and is one approach to relate features by space and time.

@ebolyen, this is a revision of the code we discussed a month or so ago. Sorry for the delay on submission.

Note: am finalizing testing right now via the command line. There is a hiccup on transformers.

ebolyen commented 4 years ago

Thanks @wasade! If it's alright, I am going to review this after our release for the F1000 revision (we're a little behind). But I will try to get this merged afterwards!

wasade commented 4 years ago

Sounds good, thanks!

ebolyen commented 3 years ago

@wasade would you be open to us adding a feature for a non-sliding window to this PR? It looks like we could just swap out the windowing function for an alternative one and keep most of the implementation the same.

wasade commented 3 years ago

Certainly! The original POC code had a non-sliding window as well, but it didn't seem as exciting (at the time) when we were running it. There was indirection on the group generation code. The prior block is below, which most likely is not a drop in replacement here... Happy to help, or expand this PR but may need a few days

def _group_by_interval(df, days, primary, secondary):
    # for each time interval
    df = df.sort_values('date')
    offset = pd.offsets.Day(days)
    col = 'date'
    for starting_timepoint, timepoint_grp in df.resample(offset, on=col):
        # for each primary
        for prim, prim_grp in timepoint_grp.groupby(primary):
            if secondary is None:
                has_sec = False
            else:
                # count the number of non-null values and check if > 0
                has_sec = (~(prim_grp[secondary].isnull())).sum() > 0

            # if we have secondary location information, and should use, then
            # for each secondary locaiton
            if secondary is not None and has_sec:
                for location, location_grp in prim_grp.groupby(secondary):
                    yield {'primary': prim,
                           'starting_timepoint': starting_timepoint,
                           'secondary': location,
                           'strains': list(location_grp['strain'])}
            else:
                yield {'primary': prim,
                       'starting_timepoint': starting_timepoint,
                       'secondary': None,
                       'strains': list(prim_grp['strain'])}
gregcaporaso commented 3 years ago

Thanks @wasade! We have an application that we're interested in trying this out for, so I may look at integrating this functionality into this PR in the next few days. I'll follow up here when I plan to start - that way if you get to it too we don't end up with conflicts. Does that sound ok to you?

wasade commented 3 years ago

Perfect, thanks!!

On Wed, Nov 18, 2020, 07:49 Greg Caporaso notifications@github.com wrote:

Thanks @wasade https://urldefense.com/v3/__https://github.com/wasade__;!!Mih3wA!WQ-PCmgSmJtZyvfZLRPTMOHZWTYsqJDxLtmL5qZaDQw53ZKgPhKMZMz8mao-mluyRVQ-$! We have an application that we're interested in trying this out for, so I may look at integrating this functionality into this PR in the next few days. I'll follow up here when I plan to start - that way if you get to it too we don't end up with conflicts. Does that sound ok to you?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/caporaso-lab/genome-sampler/pull/81*issuecomment-729768790__;Iw!!Mih3wA!WQ-PCmgSmJtZyvfZLRPTMOHZWTYsqJDxLtmL5qZaDQw53ZKgPhKMZMz8mao-moin2hm7$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AADTZMSP7ZVTHFDJBIOEYDDSQPUHXANCNFSM4R7KX5EA__;!!Mih3wA!WQ-PCmgSmJtZyvfZLRPTMOHZWTYsqJDxLtmL5qZaDQw53ZKgPhKMZMz8mao-moC6skHT$ .

wasade commented 3 years ago

@gregcaporaso, I looked back and this may just be pretty quick -- I think I can get a commit in over the next 30 minutes

gregcaporaso commented 3 years ago

That'd be amazing! I'll stay tuned.

wasade commented 3 years ago

@gregcaporaso, the current approach will yield a window if anything in that window is present. I recommend looking closely at test_group_by_nonoverlapping_windows_k2, and how, for example, feature "g" groups. At the moment, there isn't a constraint that a window start with a feature, just that it contains a feature.

gregcaporaso commented 3 years ago

Thank you @wasade! I'll have a chance to look at this later today. Will follow up then.

gregcaporaso commented 3 years ago

@wasade, thanks again for getting these changes in! This was what I needed (currently experimenting with the resulting table now).

I had a couple of ideas about the interface for this. What do you think of naming this action something like create-feature-table? I think of this action as defining what the samples and features should be from the metadata, so that seems intuitive to me. Also, what do you think of replacing outer_grouping and inner_grouping with sample_grouping (to align with the feature_grouping parameter), and making that a parameter of type List[Str]. Then could then take 1 or more grouping layers, which would be pretty flexible. If we go that direction, it might make sense to do that with feature_grouping too (that could then be used to group at multiple taxonomic levels). And it might also then make sense for both sample_grouping and feature_grouping to be optional, but require that at least one is specified.

wasade commented 3 years ago

All makes total sense! The naming conventions carried over from the POC which was a definite hack...