canonn-science / CAPIv2-Strapi

Canonn APIv2
https://api.canonn.tech
GNU General Public License v3.0
41 stars 8 forks source link

Create Bodies script for EDSM #88

Closed derrickmehaffy closed 5 years ago

derrickmehaffy commented 6 years ago

Tracking Issue for EDSM Python script for pulling body data and caching the data locally as well as scripted for cron updates.

Due to our lack of available javascript devs we should build this in Python for now and move to javascript later.

Breakdown

So most of the other 3rd parties currently store the body name as systemname bodyname however we differ in that we break the two apart. This will be the biggest challenge I believe as you cannot just hit the body table then query edsm. You will need to grab the system ID, grab the systemName, and query EDSM using the systemName. Then you will need to join the systemName with the bodyName and search the response for the data. There should also be an else if clause in that if you do not find the entry, you should then just search the data for the bodyName. I think in most cases those two should pull the correct data and properly handle special case systems like Sol where the body name is a custom named object such as Mars.

Also as a footnote to the above, it is possible that the body data does not exist, in which case we will need to skip that. I can add a special boolean column if needed to help with the cronscript so we can track how many times it has been skipped (may need to add another argument to say don't query if skip > x amount So that the cron script doesn't just keep trying to lookup the same missing body over and over. (see example arguments below)

Required arguments

Similar to the systems script, below are some arguments that should be added to allow for ease of use in specific use cases:

Breakdown of our columns vs EDSM

(WIP)

CAPI Type EDSM
edsmID float
edsmID64 float
edsmBodyID integer
edsmType string
edsmSubtype string
edsmOffset integer
edsmDistanceToArrival float
edsmIsMainStar boolean
edsmIsScoopable boolean
edsmIsLandable boolean
edsmAge integer
edsmLuminosity string
edsmAbsoluteMagnitude float
edsmSolarMasses float
edsmSolarRadius float
edsmGravity float
edsmEarthMasses float
edsmRadius float
edsmSurfaceTemperature float
edsmSurfacePressure float
edsmVolcanismType string
edsmAtmosphereType string
edsmTerraformingState string
edsmOrbitalPeriod float
edsmSemiMajorAxis float
edsmOrbitalEccentricity float
edsmOrbitalInclination float
edsmArgOfPeriapsis float
edsmRotationalPeriod float
edsmRotationalPeriodTidallyLocked boolean
edsmAxialTilt float
edsmSolidComposition json
edsmAtmosphere json
edsmMaterial json
NoFoolLikeOne commented 6 years ago

When you query on system. EDSM give you data on all bodies in that system. We should then be able to match on name relatively easily by converting the names to a common format. We can use bodyid for matching in the event of a name change. Eg I had a planet renamed to Garibaldi. Did I mention that before?

derrickmehaffy commented 6 years ago

Yeah renames are a problem Anthor keeps a file of known "Special Systems" not sure if that includes renames.

https://github.com/EDSM-NET/Alias/blob/master/Body/Name.php

derrickmehaffy commented 6 years ago

We may consider parsing this php file into a common format (or asking Anthor to provide a json file for ease of use) But that could be referenced if needed.

NoFoolLikeOne commented 5 years ago

Started work on it but will end up significantly refactoring.

It will look something like this

For each system that needs updating get bodies from edsm for each body if body in database then update else insert new body

NoFoolLikeOne commented 5 years ago

I have some doubts about how we are approaching this. My concern is that we will keep hitting EDSM and as our data set grows we will put a heavier load on EDSM.

I think the best way of doing this is to use the EDSM nightly downloads to get only data that has changed in the last 7 days. We would look at this and if it contains any systems that we have in our database then we can update them.

Every system that we store should have at least one body. So if we add a system and it does't have a body then we can fetch the bodies from EDSM at that point.

When we get the Celestial bodies update dump, we can update any existing systems with the latest body data.

If we think we are out of sync we can either hit up EDSM with individual API calls or get a full body dump

derrickmehaffy commented 5 years ago

Do we really need a body on your good 4k uss systems? :P

I've considered the dump before but we need to make strapi as lightweight as possible, grabbing only the data we need to make it easier and faster to sync and keeping size low to allow us to run multiple instances in many places.

NoFoolLikeOne commented 5 years ago

Good point about the USS systems we need a way of excluding systems from body data.

I'm not proposing that we mirror EDSM just that we get EDSM to tell us what has changed. So we could download the bodies update and not actually update anything at all

NoFoolLikeOne commented 5 years ago

Here is how we could exclude USS. We maintain list of models for which we would like to have body data. Eg, bmsites tgsites etc. but not USS Sites.

For each of these models we can build up a list of systems to check for body data Next download the bodies update. If any of our systems are in the bodies update then we update them.

We can run this process once per day or every few days if you prefer. Most of the time we wouldn't update any sites at all.

NoFoolLikeOne commented 5 years ago

It would work something like this.

models = getModelist
systems = getSystems(models)
r=requests.get(url,stream=True)
For line in iter_lines
    j=json.loads(line)
    If j[“systemId64”] in systems
        updateSystem(j)

Here is a little proof of concept

import requests
import json

url="https://www.edsm.net/dump/bodies7days.json"
r=requests.get(url,stream=True)

for line in r.iter_lines():
        if line not in ('[',']'):
                try:
                        if line[-1:] == ",":
                                d=json.loads(line[:-1])
                        else:
                                d=json.loads(line)
                        if d["systemId64"] in ( 224644818084, 626171727272,828281818282,828282828):
                                print(json.dumps(d, sort_keys=True, indent=4))
                except:
                        print("ERROR")
                        print(line)

The only thing we might want to consider is chunking the list of id64s that we are searching. But how much RAM will they take up anyway? Lets say we had 100,000 bodies and each ID64 was 64 bytes allowing for internal python gubbins then that's only 6meg of ram.

NoFoolLikeOne commented 5 years ago

So I'm doing two scripts,

body_insert_edsm.py

This will find systems that have no bodies recorded against them and look them up in EDSM. Id there are a large number of bodies to look up this is potentially quite slow. I performed a load using systems from faction kill and hyperdictions and it was interesting to note that there were two systems that were in EDSM but didn't have primary stars. I flew out to visit them and then re-ran the script and it populated them with the data.

It would be a nice idea to generate a list of systems that have no body and use that as the basis of a patrol so that we can automatically ask people to update EDSM.

A bit of fine tuning to do to the inserts.

body_update_edsm.py

This will run once per day and will download a list of all the bodies that changed in the last 7 days. It will stream the file and update or insert the data only when the system matches one of the systems held on our system. The limitation of this script is that it has to be run at least weekly and updates are no more frequent than once daily. We can also set a parameter on this that will allow us to to use the full dump to do a complete refresh if we think we need to.

derrickmehaffy commented 5 years ago

The is now being handled by the following Node based tool: https://github.com/canonn-science/Canonn-EDSM-Updater

Marking this as closed for now