Hipo / university-domains-list

University Domains and Names Data List & API
MIT License
1.29k stars 429 forks source link

Assigning an UUID to each entry #200

Open omcandido opened 5 years ago

omcandido commented 5 years ago

Hello,

I was considering using this database to perform automatic email validation for my users.

There is, however a big issue when I think about keeping my database up to date, and it is the lack of an identifier with which I can update my universities in case the JSON gets updated.

I.e. if I generate some content from one of the entries in the JSON, I want to also store in my content the identifier of the entry in the JSON, so I can also update my data periodically. With an ID, I can detect which universities have been modified and update periodically the content of my website.

Apart from an ID used within this project, having some sort of external ID that universities use (it would have to be per country, as I think there isn't such universal identifier for all universities in the world) would allow for data crossing with other sources.

Please let me know what your plans for enhancing this database are. Thanks for creating and maintaining this project!

yigitguler commented 5 years ago

Hi,

Thank you for your interest. I have some questions about your proposition:

Best

omcandido commented 5 years ago

Hi,

I intend to create university "entities" in my website, whose fields are extracted (partially) from the JSON. I don't necessarily have to save the JSON, but just parse it programatically to ensure that those entities are up to date.

Regarding the identifier, why do you think this seems hard with JSON format? We cannot guarantee 100% that the UUIDs are gobally unique (although the probability should be small enough not to matter), but that shouldn't be a problem.

Even an identifier as simple as an incrementing integer would work for the purpose of keeping track of the changes to the university entries. It would still be useful to have an official local ID for each university according to their country for crossing data purposes, but that seems much harder.

Cheers

zsubzwary commented 5 years ago

How can we add an identifier and make sure that it is unique (and will remain unique)?

I think that GUID will be the solution, every time someone updates or creates a new record he/she would be bound to update/assign new GUID (Heck we can even maybe check this in python test or something like that, which will check that has new GUID been assigned to the created/updated record or not...??)

But the downside is that the file size will grow larger...

yigitguler commented 5 years ago

@omcandido I see your point. Adding an id is a very easy technical challenge. Ensuring its uniqueness is also quite easy by writing a test. The difficulty comes on the data integrity (mostly political) side.

Right now, you can use domain names of the universities as identifiers, this will be enough to detect additions/removals. However, when a university changes its domain, it will be very hard for you to detect. I agree.

If we add IDs, this responsibility will be in the hands of the PR owner. Sometimes, we receive PRs that change the name and domain of the university. It is very difficult to know if it is an update or if it is a deletion/addition. Sometimes authorities divide a university into two universities, or two universities are merged into a new university. However, both of them keep their old domains, etc... It is a difficult problem

I propose to check if there is an accepted tracking code of educational institutions. If there is one, we can use that identification number in our list. Otherwise, it can be very difficult.

zsubzwary commented 5 years ago

Well, the problem is that every University is registered in its own country, & each country has its own rules. Like here in Pakistan HEC (Higher Education Commission) is responsible for the verification & validation & almost everything of every university. (I suspect that) they will be having a unique ID for each of them, but what about the global scale?

I don't know of any Organization which has such kind of abilities or have such records.

And By the way, if something like that Organization did exist, what would we be doing about the records of Colleges? (I have seen it, this repo also contains some record of colleges as well, I also opened an issue about it #174 )

yigitguler commented 5 years ago

Maybe we can add country codes as a prefix. Ex: TR-242341. I will investigate

zsubzwary commented 5 years ago

Okay, but what is TR-242341? TR as a Turkey what is rest of it?

omcandido commented 5 years ago

I was having a look at issue #151 and looked a bit into the kind of university data that Wikidata stores. The ISNI code was a common identifier to all the universities I checked out. Out of curiosity I run this naive script to check how many universities have a match with the ISNI database:

import json
import requests
import xml
import xml.etree.ElementTree as ET

count = {'none': 0, 'success': 0, 'several': 0}

with open("../world_universities_and_domains.json", encoding='utf-8') as json_file:
    valid_json = json.load(json_file)
for university in valid_json:
    response = requests.get(
        'http://isni.oclc.nl/sru',
        params={'query': 'pica.na="{}"'.format(university['name']),
                'operation': 'searchRetrieve',
                'recordSchema': 'isni-b'}
    )

    root = ET.fromstring(response.content)
    for res in root.findall('{http://www.loc.gov/zing/srw/}numberOfRecords'):
        if res.text == '0':
            count['none'] +=1
        elif res.text == '1':
            count['success'] +=1
        else:
            count['several'] +=1

        print('{}/{}'.format(count['success'],count['none']+count['success']+count['several']))

print('None: {}'.format(count['none']))
print('Success: {}'.format(count['success']))
print('Several: {}'.format(count['several']))
print('Total: {}'.format(count['none']+count['success']+count['several']))

Output:

None: 2767 Success: 5851 Several: 1067 Total: 9685

Meaning that more than half (5851) of the universities in this .json have a direct match with the ISNI database and 1067 have potentially also a ISNI code (but would need to be assigned manually, since there are several search results).

As I say, it's a very naive script, I'm only looking for exact matches. The ISNI search engine also has an option to perform approximate searches. Maybe that could help reduce the number of searches without any results.

Also, to automatically assign the maximum number of universities, we could perform the same kind of search using the Wikidata API and fetching the ISNI number.

It would be great to cross this data with the one in Wikidata through the ISNI code.

It would still be valuable to have an internal ID, especially in absence of a global one. Some thing like TR-242341 would do it. I guess 242341 would be the ID within the country. Since we already have the alpha_two_code maybe an integer ID would suffice?

Thanks for looking into this, guys!

LeonnardoVerol commented 3 years ago

Initially, each entry could have an UUID.

Besides the global UUID, should also be possible to have country internal identifiers. We could work it out like "zip codes". (No need to prefix with the "country" code)

About updating the name/domain

I guess this would help with some other issues created as well.

PS: Probably good to think about what to do with old entries... remove them in favor of the new ones or keep them for historical reasons.