ConnectedPlacesCatapult / TomboloDigitalConnector

The Tombolo Digital Connector enables users to combine different sources of data in a transparent and reproducible way.
MIT License
58 stars 29 forks source link

Problem executing DigitalConnector recipe #545

Open thomiko opened 6 years ago

thomiko commented 6 years ago

{ "dataset": { "subjects": [ { // The output subjects are all LSOAs "provider": "uk.gov.ons", "subjectType": "lsoa", "matchRule": { "attribute": "name", "pattern": "London%" }
} ], "datasources": [ { "importerClass": "uk.org.tombolo.importer.dft.AccessibilityImporter", "datasourceId": "acs0507" },
{ // Importer for LSOA geographies "importerClass": "uk.org.tombolo.importer.ons.OaImporter", "datasourceId": "lsoa" } ], "fields": [ { // Area of LSOA "fieldClass": "uk.org.tombolo.field.value.LatestValueField", "label": "component:Travel time", "attribute": { "provider": "uk.gov.dft", "label": "SUPO008"
} } ] }, "exporter": "uk.org.tombolo.exporter.GeoJsonExporter" }

borkurdotnet commented 6 years ago

The LSOA names in London start with the name of the Borough and the LSOA labels do not have a pattern. However, all boroughs in London have a label starting the string E090. Hence the way to get all LSOAs in London is: { "subjectType": "lsoa", "provider": "uk.gov.ons", "geoMatchRule": { "geoRelation": "within", "subjects": [ { "subjectType": "localAuthority", "provider": "uk.gov.ons", "matchRule": { "attribute": "label", "pattern": "E090%" } } ] } }

thomiko commented 6 years ago

After having downloaded the external Excel file, the gradle export fails with a Java OutOfMemoryError:

Downloading external resource: https://www.gov.uk/government/uploads/system/uplo ads/attachment_data/file/357469/acs0507.xls Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceede d at org.apache.poi.hssf.usermodel.HSSFRow.createCellFromRecord(HSSFRow.ja va:223)

My laptop has 8GB of RAM, so that's not necessarily the problem. Is there a way for the gradle runExport process to assign more RAM to the process, for example by means of a commandline parameter or from an ini file?

For example in an R script you can do: options(java.parameters = "-Xmx4g" )

This often avoids Java outOfMemory problems because by default an R script only receives 1GB of RAM.

borkurdotnet commented 6 years ago

You could look at changing the value for the runExport process in the build.gradle

thomiko commented 6 years ago

If I want to retrieve the green areas info for London similar to the 'greenspace-hertfordshire.json' example recipe, what do I have to put in here?

green-areas-recipe

If I do it this way, I get the following error message:

-----> TASK FAILED: http://download.geofabrik.de/europe/great-britain/england/lo ndon-latest.osm.pbf<-----

java.io.FileNotFoundException: http://download.geofabrik.de/europe/great-britain /england/london-latest.osm.pbf

borkurdotnet commented 6 years ago

The OSM file for london is called: europe/great-britain/england/greater-london

See here a list of different areas you could put:

https://download.geofabrik.de/europe/great-britain/england.html

thomiko commented 6 years ago

What does this targetCRSCode refer to? Is it a reference to the location being queried? It's from the green areas example recipe.

green-areas-recipe2

thomiko commented 6 years ago

Back to the Java OutOfMemory error: I set all the MaxHeapSize values to the maximum value (i.e. the RAM available)

maxheapsize_2

maxheapsize_3

With these settings the export process ran much longer (~ 40 mins) than before but eventually failed again with an OutOfMemoryError:

2018-03-17 18:10:10.160 [main] INFO u.org.tombolo.importer.DownloadUtils - Fetc hing local file: C:\tmp\TomboloData\uk.gov.dft\767f0676-d56b-3d2f-ab29-754898185 b8e.xls Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.poi.hssf.usermodel.HSSFRow.createCellFromRecord(HSSFRow.ja va:223)

borkurdotnet commented 6 years ago

In this case the target code is used to determine the unit used for the area. In the WGS4326 code the unit is degrees, which gives hard to interpret numbers for area. However the 27700 code uses the metric system and the results are more easily interpreted.

thomiko commented 6 years ago

So for London I should use the 27700 code as well?

borkurdotnet commented 6 years ago

Yes

borkurdotnet commented 6 years ago

Regarding the out of memory ... did you change the value for runExport as well?

But more generally, I think that we can conclude that we need to look at this importer after the weekend and find a more scalable solution.

thomiko commented 6 years ago

When I 'runExport' the following recipe to get the green areas for London, the build's successful in 10 seconds but the output file is almost empty:

{ "dataset": { "subjects": [ { // The output subjects are all LSOAs "provider": "uk.gov.ons", "subjectType": "lsoa", "matchRule": { "attribute": "name", "pattern": "E090%" } } ], "datasources": [ { // Importer for LSOA geographies "importerClass": "uk.org.tombolo.importer.ons.OaImporter", "datasourceId": "lsoa" }, { //": "Green space data for the entire UK", "importerClass": "uk.org.tombolo.importer.osm.OSMImporter", "datasourceId": "OSMGreenspace", "geographyScope": ["europe/great-britain/england/greater-london"] } ], "fields": [ { //Proportion of green space "fieldClass": "uk.org.tombolo.field.transformation.ArithmeticField", "label": "index:GreenspaceFraction", "operation": "div", "field1": { // Sum of green space areas "fieldClass": "uk.org.tombolo.field.aggregation.GeographicAggregationField", "label": "GreenspaceSum", "subject": { "provider": "org.openstreetmap", "subjectType": "OSMEntity" }, "function": "sum", "field": { "fieldClass": "uk.org.tombolo.field.assertion.OSMBuiltInAttributeMatcherField", "label": "AreaGreenspace", "attributes": [ { "provider": "org.openstreetmap", "label": "built-in-greenspace" } ], "field": { // Area of LSOA "fieldClass": "uk.org.tombolo.field.transformation.AreaField", "label": "AreaLSOA", "targetCRSCode": 27700 } } }, "field2": { // Area of LSOA "fieldClass": "uk.org.tombolo.field.transformation.AreaField", "label": "AreaLSOA", "targetCRSCode": 27700 } }, { // Sum of green space areas "fieldClass": "uk.org.tombolo.field.aggregation.GeographicAggregationField", "label": "component:GreenspaceSum", "subject": { "provider": "org.openstreetmap", "subjectType": "OSMEntity" }, "function": "sum", "field": { "fieldClass": "uk.org.tombolo.field.assertion.OSMBuiltInAttributeMatcherField", "label": "AreaGreenspace", "attributes": [ { "provider": "org.openstreetmap", "label": "built-in-greenspace" } ], "field": { // Area of LSOA "fieldClass": "uk.org.tombolo.field.transformation.AreaField", "label": "AreaLSOA", "targetCRSCode": 27700 } } }, { // Area of LSOA "fieldClass": "uk.org.tombolo.field.transformation.AreaField", "label": "component:AreaLSOA", "targetCRSCode": 27700 } ] }, "exporter": "uk.org.tombolo.exporter.GeoJsonExporter" }

What's wrong with it?

Output: {"type":"FeatureCollection","features":[]}

borkurdotnet commented 6 years ago

The subject specification for LSOAs in london is:

{
"subjectType": "lsoa",
"provider": "uk.gov.ons",
"geoMatchRule": {
"geoRelation": "within",
"subjects": [
{
"subjectType": "localAuthority",
"provider": "uk.gov.ons",
"matchRule": {
"attribute": "label",
"pattern": "E090%"
}
}
]
}
}

instead of

{ // The output subjects are all LSOAs "provider": "uk.gov.ons", "subjectType": "lsoa", "matchRule": { "attribute": "name", "pattern": "E090%" } }
thomiko commented 6 years ago

Unfortunately still no full success:

-----> TASK FAILED: Could not compute Field component:AreaLSOA for Subject E0100 0001(2480), reason: For input string: "590983,03"<----- Caused by null

java.lang.IllegalArgumentException: Could not compute Field component:AreaLSOA f or Subject E01000001(2480), reason: For input string: "590983,03" at uk.org.tombolo.exporter.GeoJsonExporter.lambda$getPropertiesForSubjec t$0(GeoJsonExporter.java:71)

borkurdotnet commented 6 years ago

Interesting ... It could that the digital connector is not German proof :/ (need more debugging to be sure)

I.e. it could be that some of the system outputs numbers using your localised environment (using commas for decimals) but another part of the system does not use the localised version (using dots for decimals).

Thanks for hanging in there and trying ... sorry for things not working well.