astronomy-commons / axs

Astronomy eXtensions for Spark: Fast, Scalable, Analytics of Billion+ row catalogs
https://axs.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
23 stars 12 forks source link

Tracking issue for a new AXS release #26

Open ctslater opened 3 years ago

ctslater commented 3 years ago

Spark 3.0 has been out for a while, and that seems like a good chance to collect all the improvements we've had laying around, merge them, and make a new AXS release. I'm opening this as a sort of meta-issue for keeping track of what we think is most important to include. @mjuric @zecevicp @stevenstetzler, let me know in the comments what your thoughts are and I'll try and keep this post updated with a combined list and status. If there's no significant disagreement on these, I'll start fixing and merging them.

Small fixes/improvements:

Larger improvements:

I'm sure there's other things I've forgotten. We'll also have to pick a new version number 🙂

stargaser commented 3 years ago

I believe there was still some variation in the crossmatch radius towards the poles (might have been an email chain instead of a github issue).

That was in DIRAC Slack with a code snippet for AllWISE and Gaia DR2 matched in Galactic coordinates. Very likely related to #22 and #23. This would definitely be good to fix before making a new release.

A new release would be much appreciated! At IPAC I installed the Spark 3.0 preview from source, but went back to the axs-dist release when I encountered an error when running the crossmatching.

stevenstetzler commented 3 years ago

Automated builds with github actions. @stevenstetzler has PR #18 open to work on that; is it ready to go?

Yeah, the automated builds seem to work still. I recently used it to make a new Spark 3 release with an updated version of Scala (I was running into similar issues as @stargaser using Spark 3 with AXS until I updated Scala).

After learning more about how AXS integrates with Spark (axs from this repository is a drop-in jar you can add to any version of Spark, but won't work unless Spark was compiled from axs-spark with our changes), I'd recommend that we split the build up into two build tasks into the two separate repositories we have. i.e. make an axs-spark release for each major version i.e. Spark 2.4, 3.0 and preview versions. And then in this repository, build and version control the AXS jar separately, pairing it with different versions of Spark. Correct me if this won't work, but I think this would let us for example patch the AxsCatalog.crossmatch functionality and pair it with Spark 2.4 and Spark 3 builds from axs-spark, meaning we don't have to tie specific functionality of AXS to specific versions of Spark.

Either way, I think we can tweak and merge that PR and have automatic releases working out of the box.

stargaser commented 3 years ago

I have a fix for #22 in this commit and a fix for #23 in another commit. I started my branch from @stevenstetzler's fork so that I could build from source with the right Scala version and test at IPAC. Should I make a PR to that fork? Or would it be better to cherry-pick the individual commits?

The second fix involves taking the cosine of Declination columns and that might impact the speed of crossmatching.

ctslater commented 3 years ago

Thanks a ton @stargaser, that's super useful! A PR off of Steven's fork sounds perfect. I'm hoping to find some time in the next few days to start working on these.

stargaser commented 3 years ago

Thanks @ctslater, I will make a PR to Steven's fork. The fixes for the poles are not fully working: I made a 2D histogram of Gaia DR2 matched with CatWISE in Galactic coordinates, just around the north celestial pole, and that looked great; but a 2D histogram of the whole sky is not working at all. I'm still debugging it.

stargaser commented 3 years ago

The PR for the fix for the poles is stevenstetzler/axs#1.

ctslater commented 3 years ago

Quick update on the release process: I've made fixes for all of the issues listed in my initial post and have merged them to an integration branch (on both axs and axs-spark), and produced a beta release available at: https://github.com/astronomy-commons/axs/releases/tag/v1.1-beta1. The most important fix was to get the crossmatch radius consistent over all dec, and I believe that the modifications I made to the crossmatch query do so, but I'm very interested in any cases where it still seems to be misbehaving.

The main hassle for anyone trying to switch to this (beta) release is that the default schema version for the metastore has changed, and newer spark versions don't fallback to reading old metastores without changing spark-defaults.conf to tell spark to load the old versions of Hive JARs. That's feasible but annoyingly complicated (it download the necessary jars at runtime), so I recommend avoiding it if possible. I assume there's some way to do schema migration; I will post an update if I manage to do that successfully.

Any additional bug reports are welcome. I'm going to keep testing this for a while and if nothing new pops up, we'll call it a release.

stargaser commented 3 years ago

@ctslater thanks for the updates. We want to test the beta release at IRSA but it looks it will be about a week before we can do that, after the AAS 238 meeting.