Add Support for partial loading in load_transactions

JaredSchwartz / RuleMiner.jl

Association Rule Mining in Julia

MIT License

6 stars 0 forks source link

Add Support for partial loading in load_transactions #10

Closed JaredSchwartz closed 3 weeks ago

JaredSchwartz commented 3 weeks ago

Implement chunking on file I/O

[x] Implement nlines integer kwarg that with existing skiplines kwarg allows for pulling specific chunks from file

Investigate adding another method or addition of chunks kwarg where an integer value for chunks can be passed that returns an iterator for each of the chunks

Tortar commented 3 weeks ago

related to #9 sampling before loading transaction could be also useful, I don't really have at this time a good view on how the package is working currently, but if we can sample from the file itself it would be very useful I think

JaredSchwartz commented 3 weeks ago

If you do sampling during file read, you may not see all of the items. While the results from the algorithms would be the same as filtering the transactions after reading, I'd rather people get the same number of columns in their Transactions object no matter what seed they're using.

Tortar commented 3 weeks ago

If I'm following you, that could be solved by storing in a set all unique items while doing the scan for sampling

Tortar commented 3 weeks ago

But I wonder why do you mind to have the same number of columns in the transactions object?

JaredSchwartz commented 3 weeks ago

But I wonder why do you mind to have the same number of columns in the transactions object?

It's important to have all items in the dataset regardless of sampling methods for reproducibility/replicability. You're artificially altering their distribution, but they are technically still present in the data. If you sampled the rows of the of a matrix or dataframe, you wouldn't also delete the columns where there were no values prior to exploratory analysis.

JaredSchwartz commented 3 weeks ago

I'm not going to write sampling into the reader at the moment. It's too much extra work, and an implementation that samples the Transactions object is a better focus for my effort at this time.

If you or anyone else wants to write a pull request, I'd welcome it!