chuanconggao / PrefixSpan-py

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.
https://git.io/prefixspan
MIT License
414 stars 92 forks source link

Inclusion in scikit-mine #33

Open remiadon opened 4 years ago

remiadon commented 4 years ago

Hi there, very nice and rich implementation of these 3 algorithms

The INRIA center at Rennes is creating a new python library, namely scikit-mine, to centralise pattern mining methods, and improve inter-operability and consistency with other fields, such as Machine Learning.

Your API already has similarities with what scikit-mine provides, being:

In the context of scikit-mine, only BIDE and FEAT would be nice to have, as PrefixSpan mines too many patterns, and we encourage concise representations.

I also plan to try FEAT as a candidate generator for SQS-candidates, an algorithm based on MDL. To this purpose handling gaps would be required, as SQS natively accounts them when running its optimization process

Anyone to provide support for integration into scikit-mine ?

chuanconggao commented 3 years ago

Hi, it is an interesting idea.

As the maintainer, I may provide some help but it requires some effort estimation. If there any planning or scope so far?

remiadon commented 3 years ago

@chuanconggao thanks for responding,

I think we can start with something "simple"

  1. integrating BIDE, w.r.t functional definition in skmine
  2. unit tests for BIDE
  3. adding relevant example / doc for it

Rough estimation for this would be 2 weeks (my side) with daily feedback from you

Once this is done, I would add another task

chuanconggao commented 3 years ago

Sounds good.

On my side, I can start some refactoring to make it more robust.

remiadon commented 3 years ago

@chuanconggao I wrapped your code and added a few unit tests in the PR mentioned above

What sort of refactoring were you thinking of ?

My implem is still missing the top-k part though ... which is interesting to have

chuanconggao commented 3 years ago

The refactoring part is mostly docstring, typing, and bug fixes.