Closed JaredSchwartz closed 3 weeks ago
related to #9 sampling before loading transaction could be also useful, I don't really have at this time a good view on how the package is working currently, but if we can sample from the file itself it would be very useful I think
If you do sampling during file read, you may not see all of the items. While the results from the algorithms would be the same as filtering the transactions after reading, I'd rather people get the same number of columns in their Transactions object no matter what seed they're using.
If I'm following you, that could be solved by storing in a set all unique items while doing the scan for sampling
But I wonder why do you mind to have the same number of columns in the transactions object?
But I wonder why do you mind to have the same number of columns in the transactions object?
It's important to have all items in the dataset regardless of sampling methods for reproducibility/replicability. You're artificially altering their distribution, but they are technically still present in the data. If you sampled the rows of the of a matrix or dataframe, you wouldn't also delete the columns where there were no values prior to exploratory analysis.
I'm not going to write sampling into the reader at the moment. It's too much extra work, and an implementation that samples the Transactions object is a better focus for my effort at this time.
If you or anyone else wants to write a pull request, I'd welcome it!
Implement chunking on file I/O
nlines
integer kwarg that with existingskiplines
kwarg allows for pulling specific chunks from fileInvestigate adding another method or addition of
chunks
kwarg where an integer value for chunks can be passed that returns an iterator for each of the chunks