aliss / ALISS

ALISS (A Local Information System for Scotland) is a service to help you find help and support close to you when you need it most.
https://aliss.org
11 stars 3 forks source link

Different results for locations, organisations, categories on each request #126

Closed mig5 closed 2 years ago

mig5 commented 2 years ago

Hi,

I am trying to use your 'http-import' API route to fetch all services.

I am then trying to make a separate list of unique locations, organisations, and categories, as it seems these do not have an endpoint of their own (e.g each service may list the same organisation or service each time).

Here is my example script that:

1) fetches all the services 2) sorts them by id, 3) iterates over all the services to obtain any 'locations' 4) any locations that don't already exist in the outer 'locations' dict, are appended to it 5) sort the locations by id

#!/usr/bin/env python3

import requests

from pprint import pprint

def main():
    # Standard header for API calls
    headers = {"Content-Type": "application/json"}
    # URL of the ALISS import API
    url = "https://www.aliss.org"

    print(f"Fetching the first page of {url}")
    r = requests.get(f"{url}/api/v4/import/", headers=headers)
    aliss_data = r.json()["data"]
    while r.json()["next"]:
        next_url = r.json()["next"]
        if url not in next_url:
            next_url = url + next_url
        print(f"Fetching the next page {next_url}")
        try:
            r = requests.get(next_url, headers=headers)
            r.raise_for_status()
            aliss_data.extend(r.json()["data"])
        except requests.exceptions.HTTPError as err:
            print(err)
            breal

    # Sort Services by ID
    aliss_data.sort(key=lambda x: x['id'], reverse=False)

    # Locations
    locations = []
    for item in aliss_data:
        if item["locations"]:
            for loc in item["locations"]:
                if loc not in locations:
                    locations.append(loc)

    # Sort Locations by ID
    locations.sort(key=lambda x: x['id'], reverse=False)
    pprint(locations)

if __name__ == "__main__":
    main()

What is odd is that although I get the same amount of services every time (5560), I can run the same script several times in succession, and I get different results each time:

root@stage01:~# ./aliss_fetch_locations > loc1.txt
root@stage01:~# ./aliss_fetch_locations > loc2.txt
root@stage01:~# ./aliss_fetch_locations > loc3.txt
root@stage01:~# grep "'id':" loc1.txt | wc -l
3217
root@stage01:~# grep "'id':" loc2.txt | wc -l
3330
root@stage01:~# grep "'id':" loc3.txt | wc -l
3280

# This shows the count of values only in the first list but not in the second
root@stage01:~# comm -23 loc1-ids.txt loc2-ids.txt  | wc -l
205
# This shows the count of values only in the second list but not in the first
root@stage01:~# comm -13 loc1-ids.txt loc2-ids.txt  | wc -l
318
# This shows the count of values common to both files
root@stage01:~# comm -12 loc1-ids.txt loc2-ids.txt  | wc -l
3012

Any idea what is going on here?

A slightly modified version of the script that just counts the number of locations (doesn't try to skip any locations that already exist in the array):

    # Sort Services by ID
    aliss_data.sort(key=lambda x: x['id'], reverse=False)

    locations = []

    for item in aliss_data:
        if item.get("locations"):
            for loc in item["locations"]:
                locations.append(loc)

    locations_count = len(locations)
    print(f"Num of locations in list is {locations_count}")
root@stage01:~# ./aliss_fetch_locations 
Num of locations in list is 5333
root@stage01:~# ./aliss_fetch_locations 
Num of locations in list is 5299

I am getting similar issues with the 'categories' and the 'organisations' lists within each Service.

As far as I can tell, there's nothing I can do about this, the actual data returned from your API is inconsistent each time.

Appreciate any help!

mig5 commented 2 years ago

Here's another example, this time with categories:

#!/usr/bin/env python3

import requests

from pprint import pprint

def main():
    # Standard header for API calls
    headers = {"Content-Type": "application/json"}
    # URL of the ALISS import API
    url = "https://www.aliss.org"

    r = requests.get(f"{url}/api/v4/import/", headers=headers)
    aliss_data = r.json()["data"]
    while r.json()["next"]:
        next_url = r.json()["next"]
        if url not in next_url:
            next_url = url + next_url
        try:
            r = requests.get(next_url, headers=headers)
            r.raise_for_status()
            aliss_data.extend(r.json()["data"])
        except requests.exceptions.HTTPError as err:
            print(err)
            breal

    # Sort Services by ID
    aliss_data.sort(key=lambda x: x['id'], reverse=False)

    for item in aliss_data:
        if item.get("categories"):
            for cat in item["categories"]:
                print(f"Item {item['id']} has category {cat['slug']}")
        else:
            print(f"Item {item['id']} has no categories")
            pprint(item)
root@stage01:~# ./aliss_fetch_categories > categories-1.txt
root@stage01:~# ./aliss_fetch_categories > categories-2.txt

If I grep for service 0014b87f-e3b7-49c2-b857-40eb3383e33a in categories-1.txt I get:

root@stage01:~# grep 0014b87f-e3b7-49c2-b857-40eb3383e33a categories-1.txt
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category social-activity
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category activity
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category activity
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category children-families
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category parent-toddler-group
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category social-activity
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category activity
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category activity
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category children-families
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category parent-toddler-group

If I grep in categories-2.txt I get half the entries (side note: 'activity' always seems duplicated...):

root@stage01:~# grep 0014b87f-e3b7-49c2-b857-40eb3383e33a categories-2.txt 
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category social-activity
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category activity
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category activity
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category children-families
Item 0014b87f-e3b7-49c2-b857-40eb3383e33a has category parent-toddler-group

If I grep for ffa6216b-4274-4826-910b-be342b51f262 in categories-1.txt I get double the results:

root@stage01:~# grep ffa6216b-4274-4826-910b-be342b51f262 categories-*
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category housing-and-homelessness
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category disability
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category conditions
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category sensory-disability
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category conditions
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category housing-support
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category housing-adaptations
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category housing-and-homelessness
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category disability
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category conditions
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category sensory-disability
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category conditions
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category housing-support
categories-1.txt:Item ffa6216b-4274-4826-910b-be342b51f262 has category housing-adaptations

If I grep for it in categories-2.txt I get no results, as though the service wasn't returned at all! Same with ffa6216b-4274-4826-910b-be342b51f262, and plenty others.

mig5 commented 2 years ago

Here's one more example.

This python script fetches all the results from the import route, and then prints the 'name' attribute of each service to a text file.

#!/usr/bin/env python3

import requests
import time;

def main():
    # Standard header for API calls
    headers = {"Content-Type": "application/json"}
    # URL of the ALISS import API
    url = "https://api.aliss.org/"

    aliss_data = []

    r = requests.get(f"{url}v4/import/", headers=headers)
    raw = r.json()["data"]
    for i in raw:
        aliss_data.append(i)
    while r.json()["next"]:
        next_url = r.json()["next"]
        if "/api" in next_url:
            next_url = next_url.strip("/api")
        next_url = url + next_url
        try:
            r = requests.get(next_url, headers=headers)
            r.raise_for_status()
            raw = r.json()["data"]
            for i in raw:
                aliss_data.append(i)
        except requests.exceptions.HTTPError as err:
            print(err)
            break

    timestamp = int(time.time())
    with open(f"aliss-names-{timestamp}.txt", "w") as outfile:
        for item in aliss_data:
            outfile.write(item["name"] + "\n")

if __name__ == "__main__":
    main()

Running it twice on the same machine, one right after the other, I get different ordered results but I also get results in one fetch that didn't exist in the other (example: Deeside Stroke Group, Nemo Arts Embroidery)

I have attached 2 outputs of this script so you can compare them to see what I mean.

It feels to me like each request to the ALISS API is actually perhaps hitting a different backend server or database, returning different results depending.

Even aside from the script, if I go to your page https://api.aliss.org/v4/import?page=278 in my browser, and refresh the page several times, eventually 'Deeside Stroke Group' disappears from the result. So I know it's not my script, at least :)

aliss-names-1661383126.txt aliss-names-1661383167.txt

mig5 commented 2 years ago

Everything is now working since your fix went live. Thanks!