BCCN-Prog / weather_2016

For the BCCN 2016 advanced programming project
3 stars 1 forks source link

Scraping: provide the correct dict formats! #35

Closed denisalevi closed 8 years ago

denisalevi commented 8 years ago

Here is also the minimal dictionary that @janfb set up. So everybody check that your output is of the right form (using the test from @janfb ?). @janfb can you give an example how to use your test?

@clauslang @ClaudiaWinklmayr @inesw @akresnia @gkBCCN

From @janfb :

OK guys here is the example minimal dictionary required by the tests I wrote:

{'city': 'berlin',
'daily': {'1': {'high': 7.0,
                          'low': 3.0,
                          'rain_amt': 1.325,
                          'rain_chance': 78.75,
                           'wind_speed': 13.5},
                  ...},
'date': 26042016,
'hourly': {'00': {'humidity': 65.0,
                  'rain_amt': 0.0,
                  'rain_chance': 15.0,
                  'temp': 5.0,
                  'wind_speed': 3.33},
                 ...},
'site': 'wetter.com'}

note that date is a int only cityand site values are string all other values are floats make sure to use the same keys in your dicts top level keys : ['site', 'city', 'date', 'daily','hourly'] daily keys: ['high', 'low', 'rain_chance', 'rain_amt', 'wind_speed'] hourly keys: ['temp', 'wind_speed', 'rain_chance', 'rain_amt', 'humidity']

janfb commented 8 years ago

@clauslang @ClaudiaWinklmayr @inesw @akresnia @gkBCCN Here is the updated version of the minimal dict:

{'city': 'cassel',
'daily': {'1': {'high': 7.0,
                          'low': 3.0,
                          'rain_amt': 1.325,
                          'rain_chance': 78.75,
                           'wind_speed': 13.5},
                  ...},
'date': 26042016,
'hourly': {'00': {'humidity': 65.0,
                  'rain_amt': 0.0,
                  'rain_chance': 15.0,
                  'temp': 5.0,
                  'wind_speed': 3.33},
                 ...},
'site': 1}

Accepted city names are now in english: ["berlin", "hamburg", "munich", "cologne", "frankfurt", "stuttgart", "bremen", "leipzig", "hanover", "nuremberg", "dortmund", "dresden", "cassel", "kiel", "bielefeld", "saarbruecken", "rostock", "freiburg", "magdeburg", "erfurt"] accepted site ids: 0, 1, 2, 3, 4

janfb commented 8 years ago

@clauslang @ClaudiaWinklmayr @inesw @akresnia @gkBCCN This is how I use the test:

...
import test_scraper_output as tester

def scrape(date, city):
    """Scrape data for given date and city.
    :param data: should be in the format 30-05-2016
    :param city: should be the english city name, i.e., cologne, cassel, munich
    """
    # get date id
    dateInt = int(date.split('-')[0]+date.split('-')[1]+date.split('-')[2])
    # scrape full data dictionary
    data_dic = {'site': 1, # 'wetter.com' id = 1
                'city': city,
                'date': dateInt,
                'hourly': scrape_hourly(date, city),
                'daily': scrape_daily(date, city)}
    # run tests
    assert(tester.run_tests(data_dic))
    #TODO add data to data base
    # return nothing
...

I import the test script as 'tester'. In the scrape function I get the full data dictionary. Then I call the method run_tests(data_dic) giving it the full data dictionary in the above format. The method just returns true if all tests pass.

janfb commented 8 years ago

Let me know if something does not work out. We probably have to adapt the tests for every provider, see #31

janfb commented 8 years ago

The tests will be a bit less strict on the city names in the dictionary. You can use any of ["berlin", "hamburg", "munich", "cologne", "frankfurt", "stuttgart", "bremen", "leipzig", "hanover", "nuremberg", "dortmund", "dresden", "kassel", "kiel", "bielefeld", "saarbruecken", "rostock", "freiburg", "magdeburg", "erfurt", "saarbrücken", "münchen", "koeln", "nuernberg", "köln", "saarbrücken"] And I can also add more if yours are different.

janfb commented 8 years ago

If you need a function that finds the full file name of the html file given only date and city you could use:

import os
def get_filename(dirpath, date, city, mode='hourly'):
    """Looks up filename of the html file in dirpath for given date and city
    :param dirpath: relative path to the data directory
    :param date: date in the format 31-05-2016
    :param city: city as string 
    :param mode: daily or hourly data
    """
    path = None
    filelist = os.listdir(dirpath)
    for f in filelist:
        if (date in f) and (city in f) and ( mode in f):
            path = f
    return path

Applies to #36

denisalevi commented 8 years ago

Done from my side. Tests are passing.

clauslang commented 8 years ago

Everyone has implemented this, test call implemented by almost everyone, see #31