caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
62 stars 45 forks source link

Adds one-line call function and examples #32

Closed wdwvt1 closed 8 years ago

wdwvt1 commented 8 years ago

@lkursell @johnchase

Here is a one function call (_api_gibbs) for source/sink proportion prediction. Examples in the function documentation, let me know what you think.

Writing this reveals that a large amount of the code could be streamlined with Pandas and to take better advantage of ipyparallels.

lkursell commented 8 years ago

@johnchase can you also review this - I want to make sure this format will allow us to streamline ST2 integration into the optimizations we want to look at

johnchase commented 8 years ago

@wdwvt1 the _gibbs function fails if a non-whole number is present in the sink dataframe, however it will not fail if present in the source dataframe. This function should probably catch this and return a valueError in both situations. Likewise it should probably catch null values in both situations.

wdwvt1 commented 8 years ago

@johnchase - good catch. that is caught in the normal function at the script level. basically, when the function gibbs_sampler is called with fractional counts for a sink, the script tries to make an array of fractional length, which it is not so keen on doing. fractional counts in the sources are okay.

i feel like i should just call np.ceil on the sink dataframe. i like this better than rounding down because any abundance means it was seen and rounding down to zero might remove a singleton feature. but that is a vague reason, and down really have any others. happy to round a different way if ya'll like

lkursell commented 8 years ago

Sure, ceiling is fine. If its a singleton it won’t have a massive effect on the probability anyway.

On Apr 20, 2016, at 11:29 AM, Will Van Treuren notifications@github.com wrote:

@johnchase https://github.com/johnchase - good catch. that is caught in the normal function at the script level. basically, when the function gibbs_sampler is called with fractional counts for a sink, the script tries to make an array of fractional length, which it is not so keen on doing. fractional counts in the sources are okay.

i feel like i should just call np.ceil on the sink dataframe. i like this better than rounding down because any abundance means it was seen and rounding down to zero might remove a singleton feature. but that is a vague reason, and down really have any others. happy to round a different way if ya'll like

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/biota/sourcetracker2/pull/32#issuecomment-212546725

johnchase commented 8 years ago

I may be misunderstanding this but I can't see how a fractional value would ever be valid. but if it is you can call np.ceil directly on the dataframe

lkursell commented 8 years ago

If a user had run a groupby function, it might return float values for OTU counts. The user should have to deal with this before putting into source tracker _gibbs. So if floats are present in the table, raise a ValueError stating so. That way the data isn’t altered internally in the code in a way the user didn’t explicitly do.

On Apr 20, 2016, at 11:34 AM, John Chase notifications@github.com wrote:

I may be misunderstanding this but I can't see how a fractional value would ever be valid.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/biota/sourcetracker2/pull/32#issuecomment-212549553