Resolve texts to ArXiv IDs where possible, and augment our metadata with it

JonathanReeve / data-ethics-literature-review

An automated survey of literature and curricula surrounding ethics in data science. WIP.

http://data-ethics.tech

GNU General Public License v3.0

1 stars 1 forks source link

Resolve texts to ArXiv IDs where possible, and augment our metadata with it #28

Open JonathanReeve opened 3 years ago

JonathanReeve commented 3 years ago

This is a sub-task of #21, and a sibling task of #27.

Let's use the ArXiv API to resolve texts, and use the data retrieved from that API to augment our own bibliographic data.

The easy ones will be ones that have ArXiv IDs:

<https://data-ethics.tech/text/#item_28> a z:UserItem ;
    res:resource [ a bibo:Note ;
            z:note "<p>ArXiv Preprint ArXiv:1609.07236.</p>" ],
        [ a bibo:Note ;
            z:note "<p>very short, read first</p>" ],
        [ a bibo:Note ;
            z:note "<p>What Is Technology:</p>" ],
        [ a bibo:Note ;
            z:note "<p>available online at</p>" ],
        [ a bibo:Note ;
            z:note "<p>from</p>" ] .

(That example is very messy.)

This will make #14 a lot easier.

JonathanReeve commented 3 years ago

@sy2657, can you take this one?

JonathanReeve commented 3 years ago

Here is the beginning of an example where I'm querying the CrossRef API for bibliographic data that I'm then using to augment the graph..

sy2657 commented 3 years ago

Ok, sure.

On Mon, Jun 21, 2021 at 11:52 PM Jonathan Reeve @.***> wrote:

Here is the beginning of an example where I'm querying the CrossRef API for bibliographic data that I'm then using to augment the graph. https://github.com/JonathanReeve/data-ethics-literature-review/blob/1ed14be780f74c2c04a006e037145320516893df/turtleize/enhanceBibliography.py .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/JonathanReeve/data-ethics-literature-review/issues/28#issuecomment-865506774, or unsubscribe https://github.com/notifications/unsubscribe-auth/AONNUEJ425U7W2OXKMFUPALTUACHDANCNFSM47C2JSUQ .

JonathanReeve commented 3 years ago

@sy2657, just leave a note here if you have any questions about this issue, or how best to approach it.

JonathanReeve commented 3 years ago

Here's a breakdown of how I imagine this would go:

For each reading in the coursesAndTexts.ttl graph,
See if it has an ArXiv ID first,
- if it does, just get all the ArXiv's metadata for that paper,
If it doesn't have an ArXiv ID, query the API for its title and author, or whatever other fields return the most accurate results.
Maybe do some kind of check using Levenshtein distance to make sure the search result has the same title as the queried title
Merge the search result metadata with our metadata for that article

@Zhuohan-Amber, the process is pretty much the same for #27, only without the semantic scholar IDs. There, the DOI is the thing that gets us the furthest.

ssy248 commented 3 years ago

Hi, I ran your code for queryCrossRef on a small subset of coursesAndTexts.ttl and it did not print out any data...

Do I do this task for the items of the category, z:UserItem ?

Also how do I save the metadata ? I save (metadata) Title, Authors, Abstract, Comments, Report-no, Category, Journal-ref, DOI, MSC-class, ACM-class in some format ?

JonathanReeve commented 3 years ago

Yep, that code isn't working yet. It's just the beginnings of an example. The idea is just: go through each text (z:UserItem), find more metadata about the text, and then add it to the graph.

The text metadata format is bibliontology. Here's an example for an article, and here's one for a book. Bibliontology uses Dublin Core (dcterms) for things like titles and authors.

So if you have variables title and author, and others, to add to the graph, you can add it with something like:

g.add((item, dcterms:title, title))
g.add((item, dcterms:creator, author))

where item is the z:UserItem.