VowpalWabbit / coba

Contextual bandit benchmarking
https://coba-docs.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
48 stars 19 forks source link

skip processing when reading openml datasets #18

Closed lalo closed 2 years ago

lalo commented 2 years ago

Is this functionality specified elsewhere? Hope I'm not repeating some other config/knob

codecov[bot] commented 2 years ago

Codecov Report

Merging #18 (9bd9ba8) into master (19af221) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master      #18   +/-   ##
=======================================
  Coverage   99.85%   99.85%           
=======================================
  Files          49       49           
  Lines        4833     4837    +4     
=======================================
+ Hits         4826     4830    +4     
  Misses          7        7           
Flag Coverage Δ
unittest 99.85% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
coba/environments/simulated/openml.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 19af221...9bd9ba8. Read the comment docs.

mrucker commented 2 years ago

Nice, nope doesn't exist anywhere already. One small thing I might suggest is moving the "raw" parameter to the constructor. That would conform to the pattern that everything else follows in coba (i.e., strict adherence to interfaces with optional parameterization going into constructors). You can actually already see that pattern on OpenmlSource when you look at its constructor. There are several other existing configuration options.

Also, for what it is worth, anytime I've wanted raw output I usually just use the ArffReader directly, though you don't get caching and this change would. If you wanted to just use ArffReader you could do something like Pipes.join(UrlSource(<file or http url>) , ArffReader()).read() or ArffReader().filter(UrlSource(<file or http url>).read()).

lalo commented 2 years ago

Not only the cache, but also some bits of logic around the dataset type and the dropping of cols. I've shuffled things around based on your comment.

mrucker commented 2 years ago

Nice Nice, love it.