eltonlaw / impyute

Data imputations library to preprocess datasets with missing data
http://impyute.readthedocs.io/
MIT License
354 stars 49 forks source link

Multivariate Imputation by Chained Equations is going to return only mean value of the Column. #43

Closed PavanTejaDokku closed 5 years ago

PavanTejaDokku commented 5 years ago

Issue: In the module "impyute.mice" we were expecting to return the imputed value for each column based on the linear equation converged for the column. In contrast we are getting only the mean values of the column.

Reason: Here we have a logic of entering into loop only if we satisfy below condition, however we have a glitch in the condition which is making us to skip it and return the mean values imputed data set straight away.

Condition Failing: _converged = [False] * len(nullxyv) while all(converged): ..................... ....................

Resolution:

_converged = [False] * len(nullxyv) while not(all(converged)): ..................... ....................

eltonlaw commented 5 years ago

woah thanks for catching that, do you want to send in a PR?

PavanTejaDokku commented 5 years ago

sure, will let u know when I am ready

PavanTejaDokku commented 5 years ago

https://github.com/PavanTejaDokku/impyute/pull/1 Raised pull request, Let me know if u need more details.

eltonlaw commented 5 years ago

Fantastic, thanks for doing that. It seems you uncovered some other issues with the implementation as well. Nice find on the zero division and the absolute value of the delta. The PDF explaining the changes which was very helpful. Before I merge this though, there are a few changes I would like you to make:

  1. Remove the first 6 lines of the commit with the @author stuff, I'd prefer not to have that in the source code. Please add yourself to the AUTHORS.rst instead. This, as is everything is, is up for discussion, I don't know... what are your thoughts. We can also have credits in the README or something like that.
  2. With the docstrings please follow the numpydoc specification. The documentation for readthedocs is generated from the docstrings and it looks for numpydoc format.
  3. In general the use of globals() is frowned upon, it pollutes the global namespace and can cause unexpected issues down the line. For more info, check out this [stack overflow discussion] (https://stackoverflow.com/questions/19158339/why-are-global-variables-evil). But why are you
  4. At line 76, what was the rationale behind swapping out the for i, x_i, y_i, value in enumerate(null_xyv): with for i, z in enumerate(null_xyv): and manually unwrapping it. They should be equivalent right? And commenting out the #temp.append([x_i, y_i, new_value]) would just leave the temp variable as an empty list making null_xyv an empty list at the end of the first inner loop null_xyv = temp
  5. Lines 92-93: The else: converged = False lines are unneccesary I think.
  6. Could you update the unit tests for the function with the values you have in the PDF? To run the unit tests you will need to run make test in the console and have Docker installed. I can help guide you through the process if you can't figure it out.

By the way, the pull request is only on your fork at the moment. I can't merge it unless you send it back to this repo. Check out Part 2 of the accepted answer of this stack overflow discussion on how to send your pull request back to the original repo. Also there's a CONTRIBUTING guide here

Thanks again for all your work, I really appreciate it!

PavanTejaDokku commented 5 years ago

Submitted pull request, please let me know the feedback.