Transposed data for supervised learning

jmmcd commented 3 years ago

In #129 we are discussing the X dataset being transposed by PonyGE (relative to the Scikit-Learn convention).

I see that we do indeed transpose the data here https://github.com/PonyGE/PonyGE2/blob/2e0806f5ad42540c34b83eaf65d8301eec31cf29/src/utilities/fitness/get_data.py#L60.

I think the motivation here is that we can write a grammar which will work correctly whether processing a single row or a dataset. Eg in Vladislavleva4 we have x[0]|x[1]|x[2]|x[3]|x[4] https://github.com/PonyGE/PonyGE2/blob/2e0806f5ad42540c34b83eaf65d8301eec31cf29/grammars/supervised_learning/Vladislavleva4.bnf#L10. With transposed data, this works.

But it is different from the convention used by Scikit-Learn, Tensorflow, etc. Should we consider a change here?

dvpfagan commented 3 years ago

I'm torn on this one.

What would be involved in changing to a scikit-learn style dataset, that would allow for support of the Vlad4 style grammars?

Would we have to use loc etc instead of simply x[0], if its a small change we can document it in the readme.

I suppose the bigger question is what would this proposed change gain us over what we currently have.

Just some thoughts to get the discussion rolling

Dave

On Wed, 2 Jun 2021 at 17:34, James McDermott @.***> wrote:

In #129 https://github.com/PonyGE/PonyGE2/issues/129 we are discussing the X dataset being transposed by PonyGE (relative to the Scikit-Learn convention).

I see that we do indeed transpose the data here https://github.com/PonyGE/PonyGE2/blob/2e0806f5ad42540c34b83eaf65d8301eec31cf29/src/utilities/fitness/get_data.py#L60 .

I think the motivation here is that we can write a grammar which will work correctly whether processing a single row or a dataset. Eg in Vladislavleva4 we have x[0]|x[1]|x[2]|x[3]|x[4] https://github.com/PonyGE/PonyGE2/blob/2e0806f5ad42540c34b83eaf65d8301eec31cf29/grammars/supervised_learning/Vladislavleva4.bnf#L10. With transposed data, this works.

But it is different from the convention used by Scikit-Learn, Tensorflow, etc. Should we consider a change here?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PonyGE/PonyGE2/issues/130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHTHOXI6HQ6BDLT65M2L6DTQZMSPANCNFSM457C5CAQ .

jmmcd commented 3 years ago

What would be involved in changing to a scikit-learn style dataset, that would allow for support of the Vlad4 style grammars?

We are just using Numpy, not Pandas, so no loc. I think we would be removing the transpose and changing the grammars to say x[:, 0] etc. And if someone wanted to run the function on a single row of data x, they'd have to reshape it with x.reshape((1, len(x)) or similar.

would this proposed change gain us over what we currently have

Nothing! Well, just it would stick to the convention, so possibly easier for users writing custom code as in #129.

dvpfagan commented 3 years ago

Seems a small enough change to be fair and a simple note in the documentation saying we moved from x[0] to x[:,0] style indexing should cover it.

I’m easy either way

Dave

On Fri 4 Jun 2021 at 11:01, James McDermott @.***> wrote:

What would be involved in changing to a scikit-learn style dataset, that would allow for support of the Vlad4 style grammars?

We are just using Numpy, not Pandas, so no loc. I think we would be removing the transpose and changing the grammars to say x[:, 0] etc. And if someone wanted to run the function on a single row of data x, they'd have to reshape it with x.reshape((1, len(x)) or similar.

would this proposed change gain us over what we currently have

Nothing! Well, just it would stick to the convention, so possibly easier for users writing custom code as in #129 https://github.com/PonyGE/PonyGE2/issues/129.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PonyGE/PonyGE2/issues/130#issuecomment-854548363, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHTHOVHMUXK3HYE63JV7ATTRCQA5ANCNFSM457C5CAQ .

jmmcd commented 2 years ago

I think we should go ahead with this. I think the sklearn standard would be good to align with, more generally (also eventually inheriting from RegressorMixin etc). I'm planning to use PonyGE for some symbolic regression problems in the next few weeks so I have some time to make the changes and mop up any problems.

PonyGE / PonyGE2

Transposed data for supervised learning #130