etienne87 / pytorch-stream-dataloader

MIT License
48 stars 3 forks source link

label:question Using torch multiprocessing over python multiprocessing #1

Closed Azharo closed 3 years ago

Azharo commented 4 years ago

Hi,

I have been looking to build something similar as what you've got in your pytorch_iterable.py where I have an extremely large dataset (i.e., millions of documents) of text datasets (each document can be of size anywhere from 10-50 batches) and so I'm looking at it as a multistreaming data loader problem. I'm initially thinking of using python multiprocessing to set up a queue with the producer just going through the large list of documents and queuing each one up for the consumers (standard text dataloaders) to take in. Any reason you went with using torch queue here instead? New to pytorch so still figuring out the ins and outs of the data loader.

etienne87 commented 4 years ago

Hello Azharo, thanks for having a look!

So I used Pytorch Queue just by looking at original Implementation of Pytorch Dataloader, I thought perhaps it was important for the "memory pinning" to cuda so i let it there. :see_no_evil: (I will try with a classical queue to see if data transfer to gpu is slowed down)

By the way, you can change this code by streaming from text instead of a video by replacing the "process_data" function: https://github.com/etienne87/pytorch-streamloader/blob/master/pytorch_iterable.py#L32 by your own streamer.

I think I should make this class generic taking any streamer as input probably?

Azharo commented 4 years ago

Thanks for your response! Yeah switching from video to text seems pretty straightforward. I think your section that is still TBD "Scrapping Articles from internet and streaming them" is probably most similar to what I am trying to do because the multistreamer will just be grabbing documents and then the individual dataloaders would be processing and sending back batch text as stream. I have to read up a bit on the "memory pinning" to cuda as I'm still trying to get a sense for how all this should work.

etienne87 commented 4 years ago

Could you help with in this repository if you know how to stream articles? i was looking into the google news api to stream articles directly from the web but it was pretty slow...

Azharo commented 4 years ago

Sorry, should correct that. I am streaming from an Azure blob where I have 100k's of text files (downloaded from arxiv database). I spent a few hours today to go over your code, it works really well actually! For the internet streamer, are you looking into anything specific? If it's news your best bet is to grab a few free tier APIs and just rotate through them till you hit each ones max for the day. Other option is to build a web scraper but that is a pretty big undertaking. A good API is newsapi.org but their free tier is only 500 requests a day. If all you want is to test out streaming news as the input try: https://iexcloud.io/docs/api/#streaming-news their free tier is 50k messages per month. I guess you now need to possibly use python multiprocessing producer-consumer protocol as part of your iterabledataset for cases where you are feeding it a source and not a data_list

etienne87 commented 4 years ago

Great if the code works well! i guess you tried the "pytorch_iterable" right? in the end this solution built upon pytorch's dataloader (original idea of https://medium.com/speechmatics/how-to-build-a-streaming-dataloader-with-pytorch-a66dd891d9dd) seems robust and less risky than the custom one, right? Thanks for the information about the streaming news api i will try to use this for the example! What I think would be very cool is to demonstrate a dataloader where data is not stored on the disk but just streamed via internet using youtube downloads or news streaming About the producer-consumer, yes it is true that it would be more convenient to go over sources of streaming and putting them into a general fetch queue for the python processes to read from them instead of using the "split dataset", especially since each source of streaming might be not equally partitioned.