Closed physikerwelt closed 3 months ago
Example line 1:
https://ir.canterbury.ac.nz/bitstream/10092/3347/1/12619269_ch-IntegAdEdSys.pdf,https://doi.org/10.1007/978-3-642-05039-8_8,10.1007/978-3-642-05039-8_8,,http://hdl.handle.net/10092/3347,doi,,False,Semantic integration of adaptive educational systems,5623263,https://openalex.org/W1904444722,False,Semantic Integration of Adaptive Educational Systems,True,
https://portal.mardi4nfdi.de/wiki/Item:Q5623263 Full data https://api.openalex.org/works/W1904444722
Example line 3
,https://doi.org/10.1002/rnc.2973,10.1002/rnc.2973,,,doi,"['1049-8923', '1099-1239']",False,New results in robust functional state estimation using two sliding mode observers in cascade,6413175,https://openalex.org/W1941814427,False,New results in robust functional state estimation using two sliding mode observers in cascade,True,
@Daniel-Mietchen can you look at at different URLs and suggest how you would import them?
I talked to the author of the CSV file. He suggested to use prime_landing_page_url, best_landing_page, best_pdf_url in that order
The csv parser runs without any problems on the entire dataset.
root@91e78119efac:/var/www/html/extensions/MathSearch/maintenance# time php ImportOpenAlex.php /open_alex_data.csv
*******************************************************************************
NOTE: Do not run maintenance scripts directly, use maintenance/run.php instead!
Running scripts directly has been deprecated in MediaWiki 1.40.
It may not work for some (or any) scripts in the future.
*******************************************************************************
real 1m25.514s
user 1m15.850s
sys 0m2.000s
PREFIX wdt: <https://portal.mardi4nfdi.de/prop/direct/>
PREFIX wd: <https://portal.mardi4nfdi.de/entity/>
SELECT ?qid WHERE {
BIND (REPLACE(STR(?item), "^.*/Q([^/]*)$", "$1") as ?qid)
?item wdt:P1451 ?de
FILTER (?de in (
"2636744", "6895265"))
}
LIMIT 2
gets DEs but is very ineffeicent (46s).
With using haswbstatement (cf. #432 ) for individual items this can be reduced to 130ms
SELECT ?qid WHERE {
BIND (REPLACE(STR(?wbItemTitle), "^Item:Q(.*)$", "$1") as ?qid)
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:endpoint "portal.mardi4nfdi.de";
wikibase:api "Generator" ;
mwapi:generator "search" ;
mwapi:gsrsearch "haswbstatement:P1451=2559697";
mwapi:gsrnamespace "120" .
?wbItemTitle wikibase:apiOutput mwapi:title
}
}
Uses the API, cf https://portal.mardi4nfdi.de/w/api.php?action=query&format=xml&generator=search&gsrsearch=mardi&gsrnamespace=120
This works only if the https://portal.mardi4nfdi.de/wiki/Item:Q576712?action=cirrusdump includes the statement of the de number
"statement_keywords": [
"P1451=2559697"
],
Here is another variant (88ms):
PREFIX wdt: <https://portal.mardi4nfdi.de/prop/direct/>
PREFIX wd: <https://portal.mardi4nfdi.de/entity/>
SELECT
(REPLACE(STR(?item), ".*Q", "Q") AS ?qid)
WHERE {
VALUES ?de { "2636744" "6895265" }
?item wdt:P1451 ?de
}
LIMIT 2
The first 100 imported items look good
root@91e78119efac:/var/www/html/maintenance# ./run ../extensions/MathSearch/maintenance/ImportOpenAlex.php oa.csv && ./run runJobs
2024-03-19 10:01:12 OpenAlex Special: jobname=openalex240319110312 rows=array(101) segment=0 requestId=a454d3e3cea7ae04cdca07fe namespace=-1 title= (uuid=2aa53c5dee784e449d495d10a5daae4a,timestamp=1710842472) STARTING
2024-03-19 10:01:49 OpenAlex Special: jobname=openalex240319110312 rows=array(101) segment=0 requestId=a454d3e3cea7ae04cdca07fe namespace=-1 title= (uuid=2aa53c5dee784e449d495d10a5daae4a,timestamp=1710842472) t=36813 good
https://portal.mardi4nfdi.de/wiki/Special:Contributions/Openalex240319110312
Script is now running.
root@c46a4d101259:/var/www/html/maintenance# ./run ../extensions/MathSearch/maintenance/ImportOpenAlex.php /open_alex_data.csv && echo "done"
...
Push jobs to segment 24575.
Push jobs to segment 24576.
Pushed last 24577.
done
Progress can be tracked here https://portal.mardi4nfdi.de/wiki/Special:Contributions/Openalex240319020357 currently 1% (20k from 2M) done.
ETA 12h ... might be done tomorrow morning
@physikerwelt What do you think of documenting such user accounts in a way that allows others to trace the edits to a specific ticket or pull request or some such?
See https://portal.mardi4nfdi.de/w/index.php?title=User:Openalex240319020357&oldid=31230691 for an example.
@Daniel-Mietchen yes. I did that for some already. One could also add a bit of automation that would describe which command had been entered and when in the terminal... I am not sure if that would help much?
@Daniel-Mietchen yes. I did that for some already. One could also add a bit of automation that would describe which command had been entered and when in the terminal... I am not sure if that would help much?
It would be better than having no documentation. If you see a way to automate that even just a bit, great.
It would be better than having no documentation. If you see a way to automate that even just a bit, great.
@Daniel-Mietchen it depends. The username says already what script was executed and when. If the username is read it means there is no additional information if blue one sees that someone entered information. Generating user profile pages automatically would destroy the blue / red logic.
The job was completed. However, not everything could be imported.
root@4dd0d8795c52:/var/www/html/maintenance# wc -l /open_alex_data.csv
2486267 /open_alex_data.csv
vs 1,841,457 edits.
The problems with the first run, which were reverted are on the order of 1%. Thus it does not explain the 0.5M missing entries. I suggest we close this for now and re-run the script when @LizzAlice reimports the new data from zbMATH Open.
@timconrad I don't know exactly how many new titles have been inserted. Can you give an estimate of how many titles are missing?
Describe the issue Import openAlex data
Sample data
zb_title(copyright unclear)