ResidentMario / urban-physiology-old

Urban Physiology project metarepository (old; see urban-physiology-toolkit for more up-to-date code).
0 stars 0 forks source link

Resource writer doesn't properly handle non-zipped blobs #1

Closed ResidentMario closed 7 years ago

ResidentMario commented 7 years ago

An assumption that I made going in is that Socrata blobs are zipfiles provided as a link with the form https://data.cityofnewyork.us/download/ft4n-yqee/application%2Fzip (the particularity being the last part: application%2Fzip.

This assumption is incorrect: see for example this dataset, which externalizes as a %2Fvnd.ms-excel (XLSX once downloaded).

The dataset writer (socrata_reducer.write_dataset_representation) is capable of dealing with these filetypes, but it is given the wrong link by the resource writer (socrata_reducer.write_resource_representation). The latter ought to provide a listing of the format:

{
    "endpoint": "vy67-bzq3",
    "resource": "https://data.cityofnewyork.us/download/dja4-zgtf/application%2Fvnd.ms-excel",
    "flags": [
        "processed"
    ]
},

Instead it provides it as a listing of the format:

{
    "endpoint": "vy67-bzq3",
    "resource": "https://data.cityofnewyork.us/download/dja4-zgtf/application%2Fzip",
    "flags": [
        "processed"
    ]
},

Fixing this requires examining the endpoint page directly to extract the download URL. This is something that needs to be done for links anyway, at which point it can be extended to blobs as well.

ResidentMario commented 7 years ago

Done, still needs to be rerun through. See #2.