Closed adamreichold closed 2 years ago
One more nice thing about this is that we can manually inspect the raw responses to determine how to properly parse them incrementally:
> zstdcat data/responses/doris-bfs-browse-90
...
I'll approve the PR, but please answer my comments about the fall-back to request the server directly if the file is not found on disk.
Thanks for looking into it! Note that we should not merge this before #51 is reviewed and merged.
This does include a bit of behind the scenes work to encapsulate usage of the HTTP client but this seems reasonable as to ensure usage of the retry logic in any case. But as a nice end results, the harvesters themselves do not have to change materially at all.
The only requirement is that they provide a unique key for each request (basically the file name of the stored body on disk). Then we can rerun the harvester on the raw response bodies and change anything about our parsing or translation logic or metadata schema. Of course, we cannot travel in time and this will not work if the actual requests we would have made are changed due to the code changes.
As for storing the responses, this can get large quickly but they usually compress well and since for now, our harvester is mostly waiting for the network anyway I added Zstd compression which reduces the responses from our default harvester configuration from 755M to 48M.
To use this, one just needs to set the
REPLAY_RESPONSES
environment variable, e.g. by runningBelow are two examples or how this works:
the first one running the harvester against network resources
```console > RUST_LOG=info DATA_PATH=data time ./target/release/harvester 2022-08-12T16:30:09.192182Z INFO harvester: Harvesting 6 sources 2022-08-12T16:30:09.791575Z INFO harvest{source=Source { name: "stadt-leipzig", type: Ckan, url: "https://opendata.leipzig.de/", filter: None, source_url: Some("https://opendata.leipzig.de/dataset/{{name}}"), concurrency: 3, batch_size: 100 }}: umwelt_info::harvester::ckan: Harvesting 723 datasets 2022-08-12T16:30:09.849769Z INFO harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}: umwelt_info::harvester::doris_bfs: Harvesting 507 datasets 2022-08-12T16:30:10.339620Z INFO harvest{source=Source { name: "uba-gdi", type: Csw, url: "https://gis.uba.de/smartfinder-csw/api", filter: None, source_url: Some("https://gis.uba.de/smartfinder-client/?lang=de#/datasets/iso/{{id}}"), concurrency: 1, batch_size: 10 }}: umwelt_info::harvester::csw: Harvesting 180 datasets 2022-08-12T16:30:10.848271Z INFO harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Retrieved 713 documents 2022-08-12T16:30:10.854728Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4397 has no valid entry for 'NAME' 2022-08-12T16:30:10.855552Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4467 has no valid entry for 'NAME' 2022-08-12T16:30:10.857523Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4651 has no valid entry for 'NAME' 2022-08-12T16:30:10.859063Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: harvester: Failed to harvest 3 out of 713 datasets (713 were transmitted) 2022-08-12T16:30:11.822389Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=60}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2021010424644 2022-08-12T16:30:12.579735Z INFO harvest{source=Source { name: "govdata", type: Ckan, url: "https://www.govdata.de/ckan/", filter: None, source_url: Some("https://www.govdata.de/web/guest/suchen/-/details/{{name}}"), concurrency: 5, batch_size: 1000 }}: umwelt_info::harvester::ckan: Harvesting 61432 datasets 2022-08-12T16:30:13.756080Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=220}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2014111011874 2022-08-12T16:30:16.840931Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=380}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009082154 2022-08-12T16:30:17.434641Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=400}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009042313 2022-08-12T16:30:17.497272Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=410}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009011228 2022-08-12T16:30:17.962662Z INFO harvest{source=Source { name: "geodatenkatalog", type: GeoNetworkQ, url: "http://gdk.gdi-de.org/gdi-de/srv/ger/q", filter: Some("environment"), source_url: Some("http://gdk.gdi-de.org/gdi-de/srv/ger/catalog.search#/metadata/{{id}}"), concurrency: 5, batch_size: 100 }}: umwelt_info::harvester::geo_network_q: Harvesting 5315 datasets 2022-08-12T16:30:18.224830Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=470}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-201004061230 2022-08-12T16:30:18.833466Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=490}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-201006222423 2022-08-12T16:32:50.535827Z WARN harvest{source=Source { name: "geodatenkatalog", type: GeoNetworkQ, url: "http://gdk.gdi-de.org/gdi-de/srv/ger/q", filter: Some("environment"), source_url: Some("http://gdk.gdi-de.org/gdi-de/srv/ger/catalog.search#/metadata/{{id}}"), concurrency: 5, batch_size: 100 }}:fetch_datasets{summary=false from=4801 to=4900}: umwelt_info::harvester: Overwriting duplicate dataset 094bb2e5-c6fb-451a-bcfd-b52629a7e2ff 12.60user 6.09system 2:50.81elapsed 10%CPU (0avgtext+0avgdata 297584maxresident)k 0inputs+649344outputs (0major+40856minor)pagefaults 0swaps ```and the second one using the stored responses from disk
```console > RUST_LOG=info DATA_PATH=data REPLAY_RESPONSES= time ./target/release/harvester 2022-08-12T16:33:30.702428Z INFO harvester: Harvesting 6 sources 2022-08-12T16:33:30.705569Z INFO harvest{source=Source { name: "uba-gdi", type: Csw, url: "https://gis.uba.de/smartfinder-csw/api", filter: None, source_url: Some("https://gis.uba.de/smartfinder-client/?lang=de#/datasets/iso/{{id}}"), concurrency: 1, batch_size: 10 }}: umwelt_info::harvester::csw: Harvesting 180 datasets 2022-08-12T16:33:30.706461Z INFO harvest{source=Source { name: "stadt-leipzig", type: Ckan, url: "https://opendata.leipzig.de/", filter: None, source_url: Some("https://opendata.leipzig.de/dataset/{{name}}"), concurrency: 3, batch_size: 100 }}: umwelt_info::harvester::ckan: Harvesting 723 datasets 2022-08-12T16:33:30.709723Z INFO harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Retrieved 713 documents 2022-08-12T16:33:30.713670Z INFO harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}: umwelt_info::harvester::doris_bfs: Harvesting 507 datasets 2022-08-12T16:33:30.716274Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4397 has no valid entry for 'NAME' 2022-08-12T16:33:30.716831Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4467 has no valid entry for 'NAME' 2022-08-12T16:33:30.718004Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: umwelt_info::harvester::wasser_de: Document 4651 has no valid entry for 'NAME' 2022-08-12T16:33:30.718910Z ERROR harvest{source=Source { name: "wasser-de", type: WasserDe, url: "https://www.wasser-de.de/", filter: None, source_url: None, concurrency: 1, batch_size: 100 }}: harvester: Failed to harvest 3 out of 713 datasets (713 were transmitted) 2022-08-12T16:33:30.733913Z INFO harvest{source=Source { name: "geodatenkatalog", type: GeoNetworkQ, url: "http://gdk.gdi-de.org/gdi-de/srv/ger/q", filter: Some("environment"), source_url: Some("http://gdk.gdi-de.org/gdi-de/srv/ger/catalog.search#/metadata/{{id}}"), concurrency: 5, batch_size: 100 }}: umwelt_info::harvester::geo_network_q: Harvesting 5315 datasets 2022-08-12T16:33:30.743831Z INFO harvest{source=Source { name: "govdata", type: Ckan, url: "https://www.govdata.de/ckan/", filter: None, source_url: Some("https://www.govdata.de/web/guest/suchen/-/details/{{name}}"), concurrency: 5, batch_size: 1000 }}: umwelt_info::harvester::ckan: Harvesting 61432 datasets 2022-08-12T16:33:30.771718Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=60}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2021010424644 2022-08-12T16:33:30.845551Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=220}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2014111011874 2022-08-12T16:33:30.920961Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=380}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009082154 2022-08-12T16:33:30.925359Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=410}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009042313 2022-08-12T16:33:30.942785Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=410}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-2009011228 2022-08-12T16:33:30.964311Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=460}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-201004061230 2022-08-12T16:33:30.965476Z WARN harvest{source=Source { name: "doris-bfs", type: DorisBfs, url: "https://doris.bfs.de/", filter: None, source_url: None, concurrency: 5, batch_size: 10 }}:fetch_datasets{rpp=10 offset=490}: umwelt_info::harvester: Overwriting duplicate dataset urn:nbn:de:0221-201006222423 2022-08-12T16:33:31.921309Z WARN harvest{source=Source { name: "geodatenkatalog", type: GeoNetworkQ, url: "http://gdk.gdi-de.org/gdi-de/srv/ger/q", filter: Some("environment"), source_url: Some("http://gdk.gdi-de.org/gdi-de/srv/ger/catalog.search#/metadata/{{id}}"), concurrency: 5, batch_size: 100 }}:fetch_datasets{summary=false from=4801 to=4900}: umwelt_info::harvester: Overwriting duplicate dataset 094bb2e5-c6fb-451a-bcfd-b52629a7e2ff 2.96user 1.68system 0:02.42elapsed 192%CPU (0avgtext+0avgdata 213888maxresident)k 0inputs+553088outputs (0major+50395minor)pagefaults 0swaps ```Notice how not just the wall time but also the CPU utilization is much larger in the second case as the harvester does not have to wait for the network to respond. Having a edit-compile-harvest loop of a few seconds should also be very helpful when developing new harvesters or improving our metadata schema and mapping logic.
Closes #56