Try with compresison-based algorithm ?

remiadon commented 3 years ago

I can see you are using pyfim to perform frequent pattern mining here https://github.com/SkBlaz/SGE/blob/5127d851b630262da72b015ea3a4914f2ff169fa/SGE.py#L44

and one hot encode the result get a transformer (GarVectorizer)

Have you tried replacing frequent pattern mining by its compression-based counterpart ? Using scikit-mine SLIM's miner would certainly reduce the dimension of the output (max_features arg required), and speed up both discovery and tranform runtimes

EDIT : I am working on make SLIM deal with high dimensions, this is a WIP for now

SkBlaz commented 3 years ago

Thanks for this suggestion, was not aware of this scikit-mine at the time (note that this paper was published last year I believe, we merely scratched the surface of possible I'd say). If you are to try this, I'd be interested in seeing the performance/results - are you working on this?

On Sun, Mar 7, 2021 at 5:35 PM Rémi Adon notifications@github.com wrote:

I can see you are using pyfim to perform frequent pattern mining here https://github.com/SkBlaz/SGE/blob/5127d851b630262da72b015ea3a4914f2ff169fa/SGE.py#L44

and one hot encode the result get a transformer (GarVectorizer)

Have you tried replacing frequent pattern mining by their compression-based counterpart ? Using scikit-mine SLIM's miner https://scikit-mine.github.io/scikit-mine/reference/itemsets.html#slim would certainly reduce the dimension of the output (max_features arg required), and speed up both discovery and tranform runtimes

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SkBlaz/SGE/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACMSERHAGKS7G4GXZLCNEHDTCOTNFANCNFSM4YYAUT5A .

remiadon commented 3 years ago

@SkBlaz I am working on making the SLIM miner from skmine more robust when working with an alphabet of huge size : the number of items I found when re-running your experiment is 1041597

Once i'll be sure SLIM is able to fit on alphabets of this size I'll try SLIM as a drop-in replacement for pyfim.fpgrowth

How could I know my version provides a "better" embedding then ?

SkBlaz commented 3 years ago

Via down-stream evaluation. You construct the embedding, and predict the labels for each embedding with e.g., Logistic Regression. We've refurbished the SGE a bit, all the code you need for evaluation of a given representation is available and documented here: https://github.com/smeznar/SNoRe I think that if you can achieve similar performance to SNoRe, this is very interesting!

On Sun, Mar 7, 2021 at 7:00 PM Rémi Adon notifications@github.com wrote:

@SkBlaz https://github.com/SkBlaz I am working on making SLIM miner more robust when working with an alphabet of huge size : the number of items I found when re-running your experiment is 1041597

Once i'll be sure SLIM is able to fit on alphabets of this size I'll try

How could I know my version provides a "better" embedding then ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SkBlaz/SGE/issues/1#issuecomment-792324490, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACMSERFJSQG7HUM7GKT4PYLTCO5MJANCNFSM4YYAUT5A .

SkBlaz / SGE

Try with compresison-based algorithm ? #1