data-forge / data-forge-ts

The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
http://www.data-forge-js.com/
MIT License
1.34k stars 77 forks source link

Support for reading from buffer/streams? #79

Open empz opened 4 years ago

empz commented 4 years ago

I see data-forge uses papaparse under the hood to parse CSV files.

Papaparse allows reading from a stream when used in a Node environment (https://github.com/mholt/PapaParse/blob/master/README.md#papa-parse-for-node).

Can we allow such option in the library?

An idea would be to make dataForge.fromCSV() to accept either a string or a stream.

ashleydavis commented 4 years ago

It's always been the plan to support this and I even tried to implement it once. The problem is that it might require a very different interface and so I might have to save it for data-forge version 2.

I will come back to this again at some point and rethink it.

In the meantime, if you have any proposal on how this should work I'd love to discuss it with you!

olawalejuwonm commented 2 years ago

@ashleydavis i have an idea about it, and i can work on it. Because i really need this presently

ashleydavis commented 2 years ago

Hey @olawalejuwonm, I'd love to see if you could implement this. If it fits well I'd definitely like to include it in the library.

rhesus commented 2 years ago

@olawalejuwonm did you have any success with enabling streaming in papaparse? or looking into some other CSV library? Wanting to use data-forge but having some problems with memory consumption even for smaller files.

@ashleydavis I saw you split out the file system access, do you have any thoughts about trying to utilize temp files to help "batch data" and reduce memory usage?

ashleydavis commented 2 years ago

@rhesus I've decided to not attempt to implement streaming in Data-Forge. It's something I always wanted, but actually not something I ever turned out to need.

I'm more than happy for anyone to present a plan for adding streaming data to reduce memory usage.

A first step would be to create a project in GitHub that runs out of memory while processing a data file. That would give us something to centre our discussions on.

rhesus commented 2 years ago

That's fair, I've been wanting to use it inside of lambdas and I've experienced several OOM issues. Probably just a case of trying to use the wrong tool for the job.

ashleydavis commented 2 years ago

Have you tried just breaking your data into smaller bundles that can be processed separately?

That's probably easier than trying to figure out how to upgrade Data-Forge.

olawalejuwonm commented 2 years ago

Hey @olawalejuwonm, I'd love to see if you could implement this. If it fits well I'd definitely like to include it in the library.

Yes, can I open a PR for it?

ashleydavis commented 2 years ago

@olawalejuwonm of course!

A good way to start would be to log an issue describing how you would integrate the feature. Then we can discuss it there.

olawalejuwonm commented 2 years ago

sorry please, i'm very familiar with javascript but quite new to ts. can you guide me on how to go with my first contribution on this @ashleydavis ?

@olawalejuwonm of course!

A good way to start would be to log an issue describing how you would integrate the feature. Then we can discuss it there.

ashleydavis commented 2 years ago

If you are new to TypeScript, I'd suggest you learn some before trying to contribute.

Then you can proceed in one of two ways:

olawalejuwonm commented 2 years ago

Alright. Thank you

On Sun, Aug 28, 2022 at 7:30 AM Ashley Davis @.***> wrote:

If you are new to TypeScript, I'd suggest you learn some before trying to contribute.

Then you can proceed in one of two ways:

  • Log an issue and describe what you want to achieve, how you think you might achieve it and we can discuss from there.
  • Or feel free to fork and hack something in, then we can discuss how to get to a pull request.

— Reply to this email directly, view it on GitHub https://github.com/data-forge/data-forge-ts/issues/79#issuecomment-1229390911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMSQ2BE7O4HBHVS2QFKDENDV3MBQZANCNFSM4O2OT5VQ . You are receiving this because you were mentioned.Message ID: @.***>