TechNote-ai / osdg

A tool to assign Sustainable Development Goals to a scientific abstract
GNU Lesser General Public License v3.0
16 stars 2 forks source link

Is the api source code available? #2

Open caseyfitz opened 3 years ago

caseyfitz commented 3 years ago

I'm interested in the source for the tool itself, e.g., the the Dockerfile and the scripts run by the container.

caseyfitz commented 3 years ago

More specifically, inside the container we find many files that aren't in this repository

└── ubuntu
    ├── application.py
    ├── config.py
    ├── data
    │   ├── CombinedDictionaryMap.json
    │   ├── CombinedNGRAMMatrixCSR.pkl
    │   ├── FOSIndex.json
    │   ├── FOSMAP.json
    │   ├── OSDG-Ontology.json
    │   ├── SdgThresholds.json
    │   ├── Spacy_bigram_th1.md
    │   ├── spacy_idf_th1.json
    │   └── spacy_trigram_th1.md
    ├── Dockerfile
    ├── exceptions.py
    ├── get_data.py
    ├── index_html
    ├── LICENSE
    ├── __pycache__
    │   ├── config.cpython-37.pyc
    │   ├── exceptions.cpython-37.pyc
    │   ├── sdgFinder.cpython-37.pyc
    │   └── utils.cpython-37.pyc
    ├── README.md
    ├── requirements.txt
    ├── sampleAPICall.py
    ├── sdgFinder.py
    ├── setup.sh
    └── utils.py

Are these maintained in a public repository?

lukas-pkl commented 3 years ago

@caseyfitz Thanks for your question! The answer is - not yet, but we will put these in the public repo by the end of the month. So it should be online from 1st February 2021. However, we will move the repository to a new address (https://github.com/osdg-ai/osdg-tool) and the full source code will be posted there.
We are currently cleaning and refactoring the code so it would be more readable and user-friendly

caseyfitz commented 3 years ago

@lukas-pkl, looking forward to it––thanks!

caseyfitz commented 3 years ago

@lukas-pkl a quick related question (then I'll make sure to close).

I'm wondering how to interpret the "quota_9" field in the file SdgThreasholds.json, of form

{
    "SDG_1":
         {"LowerTh": 2, "UpperTh": 4, "quota_9": 6},
   "SDG_2": 
       {"LowerTh": 2, "UpperTh": 6, "quota_9": 20},
   "SDG_3":
    ....

which is used in sdgFinder.py to divide the relevance scores for each sdg

            sdg_res_raw_fosNames[key] = plh3

        # Applying .9 quota
        self.sdg_res = sorted(sdg_res_raw_n.items(), key=lambda kv: kv[1] / self.sdgThresholds[kv[0]]['quota_9'], reverse=True)

        self.sdg_res_det = {}

I couldn't find this term referenced in the main repo or the arxiv paper.

Thanks!

lukas-pkl commented 3 years ago

@caseyfitz - we are addressing issues like this in our current refactoring.

Basically, quota_9 is a parameter we use to sort the SDGs before producing the output. One of the issues we faced with was that the API sometimes produces too many SDG labels even with thresholds applied. As such, we have decided to limit the API output to three SDG labels. We select top three labels using quota_9 parameter, which we set by assigning SDG tags to a pool of publications and analyzing the distribution of SDG-FOS'es.
The parameter corresponds to 90% percentile of the distribution for each SDG, which means that we rank publication SDGs by the how close they come to this mark.

We are preparing an update to the arxiv paper, which we will present in a conference in July. We will update the arxiv version after the event.

Let me know if anything else comes up!

caseyfitz commented 3 years ago

Thank you!