Some prep work to refactor ARCO into a form that makes it better for Apache Beam pipelines as well as adding async IO to speed up. Some higher level design decisions made:
The atomic function used inside data sources is now fetch_array(time, variable) -> np.nd.array. This allows this single line to be integrated into other pipelines instead of the asyncio one. This also encourages the building of the xarray object to live elsewhere.
Number of async io tasks is limited by the property async_process_limit, this prevents people from hammering the API with 100s of requests at once. I used some code I found online and tested to act as a sort of un ordered map. I intentionally left this out of constructor because I don't want this too exposed to users because I do not want people increasing this a lot. Plus keeps the constructor clean.
Placed the async code largely in the utils.py. I think multiple data sources can use this easily.
There is a global time out for the entire data request controlled by the async_timeout limit. Its presently 100s. Similar to the previous parameter I didnt put this in the constructor to intentionally steer people away from using this to fetch massive in memory requests, rather make multiple data arrays.
Also:
Remove tp06 from ARCO since its not officially in the data set, user WB2 instead. This was a temp fix to support the tp06 models.
Earth2Studio Pull Request
Description
Some prep work to refactor ARCO into a form that makes it better for Apache Beam pipelines as well as adding async IO to speed up. Some higher level design decisions made:
fetch_array(time, variable) -> np.nd.array
. This allows this single line to be integrated into other pipelines instead of the asyncio one. This also encourages the building of the xarray object to live elsewhere.async_process_limit
, this prevents people from hammering the API with 100s of requests at once. I used some code I found online and tested to act as a sort of un ordered map. I intentionally left this out of constructor because I don't want this too exposed to users because I do not want people increasing this a lot. Plus keeps the constructor clean.async_timeout
limit. Its presently 100s. Similar to the previous parameter I didnt put this in the constructor to intentionally steer people away from using this to fetch massive in memory requests, rather make multiple data arrays.Also:
Checklist
Dependencies