Closed jmmcd closed 2 years ago
I'm torn on this one.
What would be involved in changing to a scikit-learn style dataset, that would allow for support of the Vlad4 style grammars?
Would we have to use loc etc instead of simply x[0], if its a small change we can document it in the readme.
I suppose the bigger question is what would this proposed change gain us over what we currently have.
Just some thoughts to get the discussion rolling
Dave
On Wed, 2 Jun 2021 at 17:34, James McDermott @.***> wrote:
In #129 https://github.com/PonyGE/PonyGE2/issues/129 we are discussing the X dataset being transposed by PonyGE (relative to the Scikit-Learn convention).
I see that we do indeed transpose the data here https://github.com/PonyGE/PonyGE2/blob/2e0806f5ad42540c34b83eaf65d8301eec31cf29/src/utilities/fitness/get_data.py#L60 .
I think the motivation here is that we can write a grammar which will work correctly whether processing a single row or a dataset. Eg in Vladislavleva4 we have x[0]|x[1]|x[2]|x[3]|x[4] https://github.com/PonyGE/PonyGE2/blob/2e0806f5ad42540c34b83eaf65d8301eec31cf29/grammars/supervised_learning/Vladislavleva4.bnf#L10. With transposed data, this works.
But it is different from the convention used by Scikit-Learn, Tensorflow, etc. Should we consider a change here?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PonyGE/PonyGE2/issues/130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHTHOXI6HQ6BDLT65M2L6DTQZMSPANCNFSM457C5CAQ .
What would be involved in changing to a scikit-learn style dataset, that would allow for support of the Vlad4 style grammars?
We are just using Numpy, not Pandas, so no loc
. I think we would be removing the transpose
and changing the grammars to say x[:, 0]
etc. And if someone wanted to run the function on a single row of data x
, they'd have to reshape it with x.reshape((1, len(x))
or similar.
would this proposed change gain us over what we currently have
Nothing! Well, just it would stick to the convention, so possibly easier for users writing custom code as in #129.
Seems a small enough change to be fair and a simple note in the documentation saying we moved from x[0] to x[:,0] style indexing should cover it.
I’m easy either way
Dave
On Fri 4 Jun 2021 at 11:01, James McDermott @.***> wrote:
What would be involved in changing to a scikit-learn style dataset, that would allow for support of the Vlad4 style grammars?
We are just using Numpy, not Pandas, so no loc. I think we would be removing the transpose and changing the grammars to say x[:, 0] etc. And if someone wanted to run the function on a single row of data x, they'd have to reshape it with x.reshape((1, len(x)) or similar.
would this proposed change gain us over what we currently have
Nothing! Well, just it would stick to the convention, so possibly easier for users writing custom code as in #129 https://github.com/PonyGE/PonyGE2/issues/129.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PonyGE/PonyGE2/issues/130#issuecomment-854548363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHTHOVHMUXK3HYE63JV7ATTRCQA5ANCNFSM457C5CAQ .
I think we should go ahead with this. I think the sklearn standard would be good to align with, more generally (also eventually inheriting from RegressorMixin
etc). I'm planning to use PonyGE for some symbolic regression problems in the next few weeks so I have some time to make the changes and mop up any problems.
In #129 we are discussing the X dataset being transposed by PonyGE (relative to the Scikit-Learn convention).
I see that we do indeed transpose the data here https://github.com/PonyGE/PonyGE2/blob/2e0806f5ad42540c34b83eaf65d8301eec31cf29/src/utilities/fitness/get_data.py#L60.
I think the motivation here is that we can write a grammar which will work correctly whether processing a single row or a dataset. Eg in
Vladislavleva4
we havex[0]|x[1]|x[2]|x[3]|x[4]
https://github.com/PonyGE/PonyGE2/blob/2e0806f5ad42540c34b83eaf65d8301eec31cf29/grammars/supervised_learning/Vladislavleva4.bnf#L10. With transposed data, this works.But it is different from the convention used by Scikit-Learn, Tensorflow, etc. Should we consider a change here?