athalhammer / danker

Compute PageRank on >3 billion Wikipedia links on off-the-shelf hardware.
GNU General Public License v3.0
54 stars 4 forks source link

Suggestion: add a link to "current" dump file #26

Closed prunacomar closed 1 week ago

prunacomar commented 1 month ago

Hi there, congrats on great project! I get into it asking about rank info on Wikidata.

A small tip: it might be useful to add into your publication page https://danker.s3.amazonaws.com/index.html a link to the current file versions. Perhaps something like https://danker.s3.amazonaws.com/current.allwiki.links.rank.bz2

Regards,

athalhammer commented 1 month ago

Hi there, congrats on great project! I get into it asking about rank info on Wikidata.

Thank you! Nobody has ever bought me a coffee for it :,-)

A small tip: it might be useful to add into your publication page https://danker.s3.amazonaws.com/index.html a link to the current file versions. Perhaps something like https://danker.s3.amazonaws.com/current.allwiki.links.rank.bz2

Thanks @prunacomar for the suggestion - I will look into it. At the moment I'm thinking about some additional metadata in the schema.org descriptions.

athalhammer commented 1 week ago

Probably the conclusion of this:

So I guess we will need to learn how to sort by version (or modified dates) that are in the markup of the website

athalhammer commented 1 week ago

Something like this would work:

 wget "https://danker.s3.amazonaws.com/$(curl -s https://danker.s3.amazonaws.com/index.html | grep 'version\":' | sed -s 's/      \"version\": \"//' | sed 's/\",//' | sort -u | tail -n1).allwiki.links.rank.bz2"

But instead of ugly-dissecting the HTML with the best that Linux hast to offer you can also SPARQL it (will work with rdflib==7.1.0):

from rdflib import Graph
from rdflib.parser import URLInputSource
from rdflib.plugins.parsers.jsonld import JsonLDParser

uri = "https://danker.s3.amazonaws.com/index.html"
g = Graph()

src = URLInputSource(uri)

p = JsonLDParser()
p.parse(src, g, extract_all_scripts=True)

q = """
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 
SELECT (strbefore(str(max(xsd:dateTime(CONCAT(?version, "T00:00:00Z")))), "T") as ?versiond)
WHERE {
    ?x <http://schema.org/version> ?version.
}"""

r = g.query(q)
for row in r:
    print(f"{row.versiond}")