Oxen-AI / Oxen

Oxen.ai's core rust library, server, and CLI
https://oxen.ai
Apache License 2.0
176 stars 11 forks source link

Custom datasets #393

Open ChristopherRabotin opened 3 weeks ago

ChristopherRabotin commented 3 weeks ago

Hi there,

I just learned about Oxen, and it looks very promising. I have a bit of an unusual use case and I'm trying to figure out whether Oxen could be the appropriate solution here.

I deal with planetary ephemeris files which store, in binary format, the trajectory of planets and spacecraft for possibly hundreds of years. This format was originally created by JPL in the 80s -- the specs are here: https://naif.jpl.nasa.gov/pub/naif/toolkit_docs/C/req/daf.html . The main library that reads these files is by NASA itself (through the NAIF division) and is called SPICE ... but I've rewritten it in full in Rust (because the original code is FORTRAN transliterated in C and absolutely not thread safe). This rewrite is called ANISE -- https://github.com/nyx-space/anise.

Every year, NASA releases a new and improved prediction of where the planets will be in the future -- https://naif.jpl.nasa.gov/pub/naif/generic_kernels/spk/planets/. Every day, NASA also releases the Earth orientation parameters, which specify how the Earth is actually aligned with respect to the stars (we can't predict it super well crazy enough) -- https://naif.jpl.nasa.gov/pub/naif/generic_kernels/pck/ (specifically the earth_latest_high_prec.bpc file). These files aren't typically big (ranging from single digit MB to typically ~100s MB).

In spacecraft operations, we need to ensure that the whole team of flight dynamics engineer use either the latest data (for some computations), or a specific agreed-upon version of these data. The way I've solved this in ANISE is by having a "MetaFile" structure which is pretty simple and stores the URL to the file and optionally its CRC32, so that it can be redownloaded if the CRC is unspecified or if the CRC does not match the local copy (config file example: https://github.com/nyx-space/anise/blob/master/data/latest.dhall ; basic docs: https://docs.rs/anise/latest/anise/almanac/metaload/struct.MetaFile.html ). Another related use case is that we need to publish new datasets, namely a new ephemeris file whenever we compute a new trajectory. In this case, we have consumers of this data in other teams who need to be sure that they're using the latest version of all our data prior to whatever work they're up to. At the moment, we use Kedro to organize our workflows but also to version our data on AWS S3. It works perfectly fine, but Oxen's visualization of datasets is very appealing (especially as most people in this industry are not tech savvy).

In other words, versioning of datasets is crucial. Oxen solves the versioning of datasets and provides visualizations. But Oxen doesn't support NASA's DAF format (and it's such a niche case that it probably should not). Hence my questions:

  1. Is it possible to have extensions to Oxen so that I could upload and visualize in the Oxen web UI the difference between two ephemeris files (even better if I could plot specific things with them, but that's probably a huge stretch)?
  2. ~Is it possible to upload arbitrary blobs of data on Oxen?~ (This already works!) If so, I could use ANISE to build a "companion" delivery for new ephemeris files that we deliver and users could diff the companion version. Interestingly, Oxen tries to parse this as text.
  3. ~Is it possible to download these datasets with a generic URLs that include the version, e.g. similar to how AWS S3 can have a unique link? In my experience, it's hard enough to convince the IT teams in my industry to install updates to Python, so I can't imagine the trials and tribulations to convince them to install a new binary on operational machines, but if all that's needed is curl/wget with a token, that would be fine (especially if it runs a local deployment of Oxen (which you could/should charge for in my view)).~ (This already works too!)
  4. Would it be possible to separate the Oxen crate into a workspace so that the CLI and lib don't depend on actix since that's a server req? (I'd be more than happy to work on that myself).

That's all my questions for now, and again, this is a very exciting project, so I'll be keeping a close eye on it regardless.

Thanks

gschoeni commented 3 weeks ago

Hey Christopher! Thanks for your super detailed post.

1) Data Viz

I actually think the ability to have extensions for custom visualization would be super interesting - and useful in a bunch of industries.

What sort of libraries do you typically use to visualize? Our front end is written in JS and React. We've been trying to think how we could make renderers more configurable based on the data types on the backend.

2) File types and diff

That's interesting, we sniff the first few bytes of files to detect if they are UTF-8 or not so maybe some of those files are false positives. Custom diffs would be interesting too, happy to chat more on that.

3) Generic URLs

Glad you found the download URL (let us know if it could be documented better). We are also working direct S3 integrations if that is interesting as well.

4) Oxen Crates

Currently the lib and CLI crates shouldn't depend on Actix, I'd have to double check the Cargo.toml files but there's a separate one per module. I actually think it would be awesome if the liboxen crate could be used as a library in other tools eventually. We are working on cleaning up the APIs there as we speak.

ChristopherRabotin commented 3 weeks ago

Hey Greg,

  1. Consider data viz, I typically using plotly from Python because I haven't coded in Javascript in a decade and usually export computations or raw data in a parquet file.
  2. The first nine bytes are ASCII, so that would explain why the platform thinks it's text based. To my surprise, file magic gets it right: de440s.bsp: NASA SPICE file (binary format).
  3. I was too eager to read the documentation in full, sorry about that.
  4. I'll have a closer look, my eagerness probably got the best of me here too.

Cheers