MillionConcepts / lhorizon

lhorizon helps you find things in the Solar System.
Other
6 stars 2 forks source link

Add notebook illustrating performance claims #19

Closed malmans2 closed 3 years ago

malmans2 commented 3 years ago

I would add in the documentation a notebook that illustrates the advantages of using lhorizon rather than astroquery and supports the performance improvements discussed in the paper (line 27-43).

https://github.com/openjournals/joss-reviews/issues/3495

m-stclair commented 3 years ago

Thanks for requesting this...there have been some significant backend improvements to astropy tables since I last compared these packages, and I have to walk my claim back to 10x rather than 10-100x. I have revised relevant text to reflect the current situation and avoid unfairly maligning astropy tables. (Also, while benchmarking, I found a couple of parser bugs my tests were not adequately covering, so, great.)

I've added some comparison benchmarks in the 'benchmark' directory -- they of course require bringing in additional dependencies, so I haven't added them to the main examples directory. Also, the use cases where the performance improvements are important involve pretty big chunks of geometry data, Horizons' serverside response time is so variable that you cannot do apples-to-apples comparisons with live queries, and I didn't want to add an additional 100 MB to the repository size, so I've stashed the sample data needed for these benchmarks on Google Drive. (Link is in the benchmark Notebook.)

Also, I would like to make sure that it is clear that, for many users, the reasons to use lhorizon over jplhorizons would be functionality and interface rather than performance. (And conversely, users who don't need lhorizon's features and want tight integration with the astropy ecosystem would probably prefer to use jplhorizons.) I've tried to make these points a little clearer in the paper.

steo85it commented 3 years ago

Love this! Unfortunately the notebook still fails in binder (at least for me, astropy seems to be missing) image

Also, I would suggest to add some assert checks among calls results, to make sure/prove that on top of a better timing, one also gets the same results.

malmans2 commented 3 years ago

Sorry about the delay, I was on leave. I have a couple of suggestions:

m-stclair commented 3 years ago

Love this! Unfortunately the notebook still fails in binder (at least for me, astropy seems to be missing)

Sorry, yes, I hadn't even intended these to work in Binder, because they use (by the standards of the Binder backend) large files and a lot of RAM, and I think it will be difficult to make them consistently not crash Binder. Do you think that adding local installation instructions as @malmans2 suggested is adequate?

Also, I would suggest to add some assert checks among calls results, to make sure/prove that on top of a better timing, one also gets the same results.

Good idea. Added in https://github.com/MillionConcepts/lhorizon/commit/810afb36c83e4137fd4844a3b9d87d8addff2b2f

m-stclair commented 3 years ago
(I'm not 100% sure, but I think in this way GitHub will show the README in the benchmark directory without having to click on it)

This appears to have worked!

 Use step by step instructions in the readme, including terminal commands such as how to grab the files (e.g., wget/curl), how to install the benchmark environment, and how to install lhorizon in the benchmark environment

Added.

Consider storing the benchmark files in a different repository/cloud (I believe they are currently in a google drive). E.g., you could use [git-lfs](https://git-lfs.github.com/) and store the file in this GitHub repo, or you could use zenodo.

Ok, I will think about this. Do you mean that you think that the files should have their own DOI? Or instead that it is awkward, or might be overly fragile, to distribute them via Google Drive in this way? git-lfs is basically just an automated aliasing system for external storage locations, right? So this would involve, for instance, setting up an S3 bucket and pointing the git-lfs server at credentials for that bucket (+ some addressing scheme for that bucket) in order to create versioned pointers from this repository to that bucket?

I think there is a typo in the environment file. ~`memory-profiler`~ should be `memory_profiler` https://github.com/MillionConcepts/lhorizon/blob/eee7be4ee9e14f72de643bd690163cb0640a657c/benchmark/bench-environment.yml#L16

Yes, that's a typo.

I would add jupyter in the environment, as it is consistent with the environment file provided in the root directory

Added.

m-stclair commented 3 years ago

Although the variable names are self-describing, I would make more obvious which cell is benchmarking lhorizon vs jplhorizon (e.g., add markdowns, comments, or prints)

extra markdowns & comments added in 0012dfe

malmans2 commented 3 years ago

I'd personally prefer zenodo because of the DOI or git-lfs because it seems easier to maintain/more robust. But nothing wrong with google drive, so I'll leave it up to you. Regarding git-lfs, you just have to install it, and then set which files to manage through Git LFS. No need to set up an external bucket. See Getting Started.

malmans2 commented 3 years ago

Final minor comments: In point 1: Typo ~Them~ should be Then In point 2: If you assume that the user is in the root directory the yml path should be benchmark/bench-environment.yml. But maybe it's easier to run all commands from /benchmark? This way there's no need to move the unzipped data in point 1, and no need to change directory later in point 4.

malmans2 commented 3 years ago

I'd also mention the benchmark directory in the main readme, so users know about it.

m-stclair commented 3 years ago

I always have trouble believing things like git-lfs are free...in any case, I switched over to Zenodo, because using git-lfs will require users to install it to successfully clone the repository, and it seems easier to simply instruct them to download the files. I have also addressed your other comments. All changes are present by 89b1b3c . Leaving this issue open pending response from @steo85it on Binder.

steo85it commented 3 years ago

These solutions works for me, thanks. Just a minor issue (not sure if this is somehow related to my setup): when running the timing comparison cells, I need the %%time magic command to be alone on a line of code, even the (currently present) comment causes the cell run to fail with a

UsageError: Can't use statement directly after '%%time'!

error. Simply sending the comment to the following line (or removing it) solves this.

m-stclair commented 3 years ago

These solutions works for me, thanks. great!

error. Simply sending the comment to the following line (or removing it) solves this. oh, sorry, that's a bad copy-paste or something. fixed in 4c982dd