databrickslabs / mosaic

An extension to the Apache Spark framework that allows easy and fast processing of very large geospatial datasets.
https://databrickslabs.github.io/mosaic/
Other
269 stars 66 forks source link

Mosaic version that support Databricks DBR14.x #532

Open jonasmw94 opened 6 months ago

jonasmw94 commented 6 months ago

Background

Our team is using Databricks on GCP and we are dependent on UDFs. As you can see here, they are only available on GCP from DBR14.1 and above with Unity Catalog. We would like to utilize Mosaic, but are unable to do so at this moment because of this restriction.

Proposed solution

An updated version of Mosaic that is compatible with DBR14.1+

Quesiton

What is a realistic timeline for an updated Mosaic version for DBR14.1+?

mjohns-databricks commented 6 months ago

I anticipate on/around end of March. @jonasmw94 we might be able to get you an early version. Please feel free to reach back out. Lot's of transition with Unity Catalog!

jonasmw94 commented 6 months ago

That would be great. We are going to have an organization-wide demo for our geospatial specialists in 1.5 weeks. If we could have it by then, I think it would be a great addition. To have GDAL available for this demo would be perfect. I will contact you on LinkedIn 👍

landlord-matt commented 5 months ago

Is there any update on this or the Unity Catalog and Shared Cluster issue?

mjohns-databricks commented 3 months ago

I wanted to circle back on this question - sorry to mislead ^. We ended up with some re-prioritized work that is just finishing up for DBR 13 (for a customer), see #562 mainly plus a rush release on 0.4.2 due a geopandas dependency breaking change. We intend to release 0.4.3 now very soon for DBR 13, then turn attention fully to DBR 14. Expect that release to be first available in late JUN. In the meantime, are you using our Spatial SQL functions in DBR 14.3 (preview)?

jonasmw94 commented 3 months ago

I understand, no problem at all. For now, we are utilising Apache Sedona for our spatial transformations/queries until DBR14 support for mosaic. Do you have any documentation about what spatial functions that are supported natively? Thank you!

marcelhfm commented 2 months ago

Hey there @mjohns-databricks,

are there any updates concerning the release of 0.4.3. Is there a concrete timeline yet?

Thanks

jcz-trackunit commented 1 month ago

@mjohns-databricks we are considering changing the Mosaic to a different solution because of this. Is there any plan to support DBR14 and if so, what is the timeline?

mjohns-databricks commented 1 month ago

@jcz-trackunit and @marcelhfm (and others on this, umm, aging thread)

Here is a status on where things sit. As you may know, DBLabs are field-led initiatives for the benefit of our customers and are different than Databricks Product (PM + Eng side). We had a couple of redirects in DBLabs Mosaic due to pressing customer needs around improving raster support which has kept us from making the switch to DBR 14 (for longer than we anticipated). It has been a rather hefty set of changes (over 10K lines of code affected). The backstory on all of this is best over beers, but in short our intention is to put out this final release in the DBR 13.3 series as Mosaic 0.4.3, which is now very close to ready. The Mosaic 0.4.3 release will be the basis of DBLabs Spatial-Utils v1 which is the rename / follow-on to DBLabs Mosaic, targeting DBR 14.3 for v1 and is far along in planning -- it borrows very heavily from Mosaic. There is a really good chance we update the Mosaic repo in place for the Spatial-Utils v1 release (project name, artifact names, classpaths, and docs will change accordingly). We need to change the name from Mosaic due to our acquisition of MosaicML last year and our new product line called MosaicAI. So DBLabs Mosaic cannot keep the name, unfortunately.

In the upcoming Spatial-Utils v1 release for DBR 14.3 we are really focused on:

  1. Not conflicting with the existing vector product Spatial SQL APIs (there are ~60 in private preview currently for DBR 14.3+), essentially Spatial-Utils will defer to product APIs where possible. If you are thinking of DBLabs Mosaic for SQL ST_ functions (vector), we would point you to product for that more so going forward. Product is advancing towards public preview for Spatial SQL on/around end of 2024. There is more to it, just trying to be concise.
  2. Related we do not want to conflict with the product 30+ H3 APIs (GA status for a long time).
  3. Keep and extend use of Unity Catalog for Tables / Views and such.
  4. Introduce Spatial-Utils Volume support in DBR 14.3 (access to scala fuse mounts, etc ).
  5. Keep and extend performance driving features in Databricks Lakehouse such as spatial data engineering and query patterns that use Liquid Clustering - these will augment product-heavy patterns for DLT and DBSQL patterns.
  6. Focus will be on "Assigned Clusters" in DBR 14.3 (not focusing on "Shared Access" or "Serverless" varieties in v1, may get lucky but that will be coincidental).
  7. Not focusing on adjusting function registration for Unity Catalog specifics too much in v1 (registration mostly the same as currently with DbLabs Mosaic).
  8. Not focusing on adjusting the existing spark framework structure for the Spatial-Utils v1, will mostly be same APIs, written primarily in Scala (hence Assigned clusters smoothest path first)

These are the highlights. Because first version of Spatial-Utils will be "close" to this last release of DBLabs Mosaic, we don't have "so much" to do as we round out this final Mosaic release I mentioned. I hesitate to state any dates but as the world sits now, pending any adjustments, we are aiming for AUG for DBLabs Mosaic 0.4.3 and SEP for DBLabs Spatial-Utils v1. I am willing to "live" discuss with any of you if you would like more details. Fee free to hit me up on LinkedIn or otherwise if you have my contact. We value our customers and hope to get much better going forward in helping you navigate this transitional period for geospatial on Databricks. It is going to get really exciting as the year further progresses!

landlord-matt commented 1 month ago

Thanks for the update!

Could you expand on the successor library Spatial-Utils vs Vector Product Spatial SQL APIs? The Vector Product Spatial APIs is what in the official documentation is referred to as the H3 geospatial functions?

You said that the Spatial Utils will defer to the Vector Product Spatial APIs as much as possible, but will then remain of Spatial Utils? Is there a clear difference in scope?

landlord-matt commented 1 month ago

If I have understand things correctly, one difference is that Mosaic could in theory be run in a self managed environment, while H3 geospatial functions and Spatial Utils will then require Databricks? In our case self managing was never an option so that is not that big of an issue.

One aspect of the Databricks libraries (e.g. dbutils) that does not spark joy, and I assume will then extend to the Spatial Utils, is that they mess up Python linting by not being available locally or can be unit tested. Is there a work around for this?

mjohns-databricks commented 4 weeks ago

@landlord-matt - product has 60+ ST_ functions in private preview (very similar to the ones in Mosaic), starting in DBR 14.3. So, essentially the same calls you can accomplish today in Mosaic, you can accomplish in product. You have to be "opted" in to the preview, so no public docs yet. Is that something you are interested in? If so, please hit me up on LinkedIn.

mjohns-databricks commented 4 weeks ago

^ ST_ functions like ST_Buffer which are different than our product H3 APIs. The ST_ functions are in private preview meaning there are not public docs quite yet. That will happen over the 2nd half of 2024.