TutteInstitute / vectorizers

Vectorizers for a range of different data types
BSD 3-Clause "New" or "Revised" License
97 stars 23 forks source link

Added SignatureVectorizer (iisignature) #108

Closed jh83775 closed 1 year ago

jh83775 commented 2 years ago

I've implemented SignatureVectorizer, which returns the path signatures for a collection of paths.

This vectorizer essentially wraps the iisignature package such that it fits into the standard sklearn style fit_transform pipeline. While it does require iisignature, the imports are written such that the rest of the library can still be used if the user does not have iisignature installed.

For more details on the path signature technique, I've found this paper quite instructive: A Primer on the Signature Method in Machine Learning (Chevyrev, I.)

codecov-commenter commented 2 years ago

Codecov Report

Merging #108 (d85ceb1) into master (80f5319) will increase coverage by 0.28%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #108      +/-   ##
==========================================
+ Coverage   90.85%   91.13%   +0.28%     
==========================================
  Files          32       34       +2     
  Lines        4649     4774     +125     
==========================================
+ Hits         4224     4351     +127     
+ Misses        425      423       -2     
Impacted Files Coverage Δ
vectorizers/__init__.py 100.00% <100.00%> (ø)
vectorizers/signature_vectorizer.py 100.00% <100.00%> (ø)
vectorizers/tests/test_signature_vectorizer.py 100.00% <100.00%> (ø)
vectorizers/preprocessing.py 88.51% <0.00%> (+0.95%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

lmcinnes commented 2 years ago

This looks like a good addition. Would you be willing to add some basic tests, and possibly even a tutorial guide for the documentation?

Also is it worth adding iisignature to the requirements?

jh83775 commented 2 years ago

No problem, I should have some time over the next few weeks to add tests and tutorial guides.

Happy to add iisignature to the requirements too - the current implementation was in case we wanted to avoid having too many dependencies, but if that's not an issue I'll add it in directly.

lmcinnes commented 2 years ago

Thanks -- I know the tests and documentation are a lot of extra work, but they are very valuable for maintenance, and for ensuring users can find and use the feature, so it will definitely be worth the effort in the long run.

As far as dependencies go ... I haven't checked what iisignature requires. If it is reasonably self contained then one extra dependency is probably fine for now.

jh83775 commented 1 year ago

I have added an example notebook and some tests for the SignatureVectorizer. I'll add more functionality (e.g. common path transforms) and more examples in the future too, but this should suffice as a good starting point.

I had some trouble getting the automated tests to work. For some reason including iisignature in the requirements file broke something (error messages suggested it was being installed before numpy and thus breaking since numpy is a dependancy?). I've commented out iisignature in the requirements.txt for now, and added it explicitly to azure-pipelines.yml, which seems to work. Do you have any suggestions? I suspect there might be something trivial which I've missed. Edit: Seems like iisignature imports numpy in its setup.py file, and this was causing problems in the automated test pipelines. I've re-added iisignature to requirements.txt and put an explicit pip install numpy in azure-pipelines.yml just before the pip install -r requirements.txt, just so its definitely present on the system before that call. I think this resolves the issue I was having with minimal changes to the existing code.

lmcinnes commented 1 year ago

Thanks for this -- as I say it should be very valuable for a great many people and is a great complement to the techniques we had so far. I'm looking forward to putting this to use myself.