Crawling takes too long for large (>100 datasets) catalogs

eWaterCycle / jupyterlab_thredds

JupyterLab dataset browser for THREDDS catalog

Apache License 2.0

25 stars 3 forks source link

Crawling takes too long for large (>100 datasets) catalogs #23

Closed evertrol closed 5 years ago

evertrol commented 5 years ago

Catalog URLs that I have attempted (with some timing estimates):

# datasets	time (seconds)	URL
1471	-	https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/livneh/catalog.xml
371	469	https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/livneh/metvars/catalog.xml (subset of previous catalog)
456	3051	https://thredds.daac.ornl.gov/thredds/catalog/ornldaac/1345/catalog.xml
18	4.5	https://data.ioos.us/gliders/thredds/catalog/deployments/aoml/catalog.xml
9	0.85	http://tds.maracoos.org/thredds/REALTIME-MODIS.xml
5	0.65	http://www.neracoos.org/thredds/catalog/Tempests/Buoys/MA101/catalog.xml

Timings are obviously very non-linear, presumably because of the extra queries for the WMS layers and server response speed.

sverhoeven commented 5 years ago

You have to restart Jupyter to stop the crawling. As long as cancelling #18 has not been implemented.

evertrol commented 5 years ago

Could be good to query for the WMS layer(s) in a dataset only when inserting a new notebook cell, and raise an exception if no (appropriate) layer(s) could be found.

sverhoeven commented 5 years ago

Switched to async crawler, measured http response time in devtools.

# datasets	time (seconds)	URL
1471	3.6	https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/livneh/catalog.xml
371	1.69	https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/livneh/metvars/catalog.xml (subset of previous catalog)
456	23	https://thredds.daac.ornl.gov/thredds/catalog/ornldaac/1345/catalog.xml
18	1.0	https://data.ioos.us/gliders/thredds/catalog/deployments/aoml/catalog.xml
9	0.2	http://tds.maracoos.org/thredds/REALTIME-MODIS.xml No services recognized by siphon
5	0.41	http://www.neracoos.org/thredds/catalog/Tempests/Buoys/MA101/catalog.xml

evertrol commented 5 years ago

Those are indeed times of the order that I've seen when using an async/mp crawler stand-alone. But I assume this new table lists times without queries for WMS layers, as per the linked commit? All in all, these new timings make the extension much more usable. Probably can close this issue.

sverhoeven commented 5 years ago

Yes, wms layers has been moved to notebook cell.