EricDiao / Project-ShanghaiTech-CS181-18Fall

1 stars 0 forks source link

Asynchronous dataset crawling #7

Open EricDiao opened 5 years ago

EricDiao commented 5 years ago

The learning algorithm parts needs dataset within a fixed time period. So a "cache" is needed.

Plan for solving that is:

  1. Label the data with a timestamp label;
  2. use multiprocessing to create two process. One for the actual training, one for fetch data from FR24;
  3. Use multiprocessing.Queues for inter-process data transfer.

Add this issue to #4.

EricDiao commented 5 years ago

The asynchronous dataset crawler is implemented in /atc.py and /data_sources/flightradar24Crawler.py.

This part is consist of two parts. The first one is called data_provider which is implemented in /data_sources/flightradar24Crawler.py ( the function crawlFR24MultiprocessingWrapper ). It does only one thing: fetch data from flightRadar24 and put this data into a multiprocessing.Queues object which is also accessible to data_consumer that will be discussed below.

The second part is data_consumer. data_consumer shall be implemented by the learning algorithm part. It takes in at least one parameter data_queue. An example implemented in data_consumer in atc.py.

The main invoker of our program is atc.py. It is also the part where data_provider and data_consumer is called.

@Hang14 See if this implementation is feasible.

EricDiao commented 5 years ago

The schema of data the data_provider provides is described below:

{
  timestamp: flight_data,
}

This is basically a dict object of python, where timestamp is UNIX timestamp in float format and flight_data is a list of all flights we get at this time point in the format described in crawlFR24 in /data_sources/flightradar24Crawler.py.