Closed wdwvt1 closed 8 years ago
@johnchase can you also review this - I want to make sure this format will allow us to streamline ST2 integration into the optimizations we want to look at
@wdwvt1 the _gibbs function fails if a non-whole number is present in the sink dataframe, however it will not fail if present in the source dataframe. This function should probably catch this and return a valueError
in both situations. Likewise it should probably catch null values in both situations.
@johnchase - good catch. that is caught in the normal function at the script level. basically, when the function gibbs_sampler
is called with fractional counts for a sink, the script tries to make an array of fractional length, which it is not so keen on doing. fractional counts in the sources are okay.
i feel like i should just call np.ceil
on the sink dataframe. i like this better than rounding down because any abundance means it was seen and rounding down to zero might remove a singleton feature. but that is a vague reason, and down really have any others. happy to round a different way if ya'll like
Sure, ceiling is fine. If its a singleton it won’t have a massive effect on the probability anyway.
On Apr 20, 2016, at 11:29 AM, Will Van Treuren notifications@github.com wrote:
@johnchase https://github.com/johnchase - good catch. that is caught in the normal function at the script level. basically, when the function gibbs_sampler is called with fractional counts for a sink, the script tries to make an array of fractional length, which it is not so keen on doing. fractional counts in the sources are okay.
i feel like i should just call np.ceil on the sink dataframe. i like this better than rounding down because any abundance means it was seen and rounding down to zero might remove a singleton feature. but that is a vague reason, and down really have any others. happy to round a different way if ya'll like
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/biota/sourcetracker2/pull/32#issuecomment-212546725
I may be misunderstanding this but I can't see how a fractional value would ever be valid.
but if it is you can call np.ceil
directly on the dataframe
If a user had run a groupby function, it might return float values for OTU counts. The user should have to deal with this before putting into source tracker _gibbs
. So if floats are present in the table, raise a ValueError
stating so. That way the data isn’t altered internally in the code in a way the user didn’t explicitly do.
On Apr 20, 2016, at 11:34 AM, John Chase notifications@github.com wrote:
I may be misunderstanding this but I can't see how a fractional value would ever be valid.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/biota/sourcetracker2/pull/32#issuecomment-212549553
@lkursell @johnchase
Here is a one function call (
_api_gibbs
) for source/sink proportion prediction. Examples in the function documentation, let me know what you think.Writing this reveals that a large amount of the code could be streamlined with Pandas and to take better advantage of ipyparallels.