WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
220 stars 78 forks source link

Build a dictionary for software mining #2002

Open Daniel-Mietchen opened 2 years ago

Daniel-Mietchen commented 2 years ago

The basic idea would be to have a list of strings frequently used to represent certain pieces of software. This list can then be enhanced, e.g. by mapping each of these strings to their corresponding Wikidata item.

Daniel-Mietchen commented 2 years ago

Some possible starting points:

  1. the Softcite dataset available from https://github.com/jameshowison/softcite (older) and https://github.com/howisonlab/softcite-dataset (newer): https://github.com/jameshowison/softcite/blob/743db9c7a431486d1ab5aaacb91f1d0a32e161f1/output/analysis_output.txt#L38 and https://github.com/howisonlab/softcite-dataset/blob/bd08f4f8801341e3a631008aa560c554a7076b4b/data/software_lists/go/softcite_software_names.csv .
  2. Via Wikidata queries for software already tagged as being used/ discussed in publications, as in https://w.wiki/5F9m.
Daniel-Mietchen commented 2 years ago

That Wikidata query times out sometimes, so here is a slightly simplified version that is quicker: https://w.wiki/5F9v .

WolfgangFahl commented 2 years ago

Can't get this running on qlever https://qlever.cs.uni-freiburg.de/wikidata/6PYy2P

ShweataNHegde commented 2 years ago

Create a software dictionary

Daniel-Mietchen commented 2 years ago

Here is a query for Wikidata-known papers with "climate change" in their title AND a PMCID AND a "full text available" tag: https://w.wiki/5FAi

Daniel-Mietchen commented 2 years ago

Can't get this running on qlever https://qlever.cs.uni-freiburg.de/wikidata/6PYy2P

I think this one is on Qlever not being feature complete yet.

ShweataNHegde commented 2 years ago

I've mined the full text for software mentions. Here's the CSV output: https://github.com/petermr/docanalysis/blob/main/resources/software_mentions.csv

You can find the Notebook I created here: https://colab.research.google.com/drive/1Kw7XDSce2s_EfzZIEkaIVM4yfunmKZvp?usp=sharing

I haven't added documentation, yet.

ShweataNHegde commented 2 years ago

I realized that .csv was limiting since the lists in the has_terms column were actually turned into strings. I have now added a method to output the results in JSON format. This should make downstream analysis easier and the results more usable.

  "20955": {
        "file_path": "/content/software_related_papers/PMC4184317/sections/1_body/1_implementation/1_software_architecture/3_p.xml",
        "paragraph": "For cancer data analysis, we imported the cancer gene index (CGI,  https://wiki.nci.nih.gov/display/cageneindex ) data into a MySQL database and then developed a hibernate API for the server-side application. The CGI data contains annotations for cancer-related genes. These annotations were extracted by using text-mining technologies and then validated by human curators (  https://wiki.nci.nih.gov/display/cageneindex/Creation+of+the+Cancer+Gene+Index ).",
        "sentence": "For cancer data analysis, we imported the cancer gene index (CGI,  https://wiki.nci.nih.gov/display/cageneindex ) data into a MySQL database and then developed a hibernate API for the server-side application.",
        "section": "ALL",
        "has_terms": [
            "MySQL"
        ],
        "weight_terms": 1
    },