MaRDI4NFDI / portal-compose

docker-composer repo for mardi
https://portal.mardi4nfdi.de
GNU General Public License v3.0
3 stars 1 forks source link

Import OpenAlex data #506

Closed physikerwelt closed 3 months ago

physikerwelt commented 3 months ago

Describe the issue Import openAlex data

Sample data

best_pdf_url,prime_landing_page_url,doi,best_oa_status,best_landing_page_url,type,prime_source_issn,excluded,zb_title,document,openalex_id,prime_is_oa,openalex_title,best_is_oa,prime_license
https://ir.canterbury.ac.nz/bitstream/10092/3347/1/12619269_ch-IntegAdEdSys.pdf,https://doi.org/10.1007/978-3-642-05039-8_8,10.1007/978-3-642-05039-8_8,,http://hdl.handle.net/10092/3347,doi,,False,Semantic integration of adaptive educational systems,5623263,https://openalex.org/W1904444722,False,Semantic Integration of Adaptive Educational Systems,True,
https://arxiv.org/pdf/1507.03817,https://doi.org/10.1016/j.aop.2015.06.012,10.1016/j.aop.2015.06.012,,https://arxiv.org/abs/1507.03817,doi,"['1096-035X', '0003-4916']",False,An angular frequency dependence on the Aharonov-Casher geometric phase,6714651,https://openalex.org/W1941093984,False,An angular frequency dependence on the Aharonov–Casher geometric phase,True,
,https://doi.org/10.1002/rnc.2973,10.1002/rnc.2973,,,doi,"['1049-8923', '1099-1239']",False,New results in robust functional state estimation using two sliding mode observers in cascade,6413175,https://openalex.org/W1941814427,False,New results in robust functional state estimation using two sliding mode observers in cascade,True,
http://arxiv.org/pdf/1508.01191,https://doi.org/10.1007/978-3-642-15034-0_5,10.1007/978-3-642-15034-0_5,,http://arxiv.org/abs/1508.01191,doi,,False,A different perspective on a scale for pairwise comparisons,5809240,https://openalex.org/W1942171147,False,A Different Perspective on a Scale for Pairwise Comparisons,True,
,https://doi.org/10.1007/978-3-642-16567-2_14,10.1007/978-3-642-16567-2_14,,,doi,,False,Exponential ranking: taking into account negative links,5827801,https://openalex.org/W1956304782,False,Exponential Ranking: Taking into Account Negative Links,True,
,https://doi.org/10.1007/s10958-015-2590-3,10.1007/s10958-015-2590-3,,,doi,"['1072-3374', '1573-8795']",False,On constants in Maxwell inequalities for bounded and convex domains,6534362,https://openalex.org/W1959711015,False,On Constants in Maxwell Inequalities for Bounded and Convex Domains,True,
,https://doi.org/10.1006/jdeq.1999.3653,10.1006/jdeq.1999.3653,,https://doi.org/10.1006/jdeq.1999.3653,doi,"['1090-2732', '0022-0396']",False,Relaxation limit for piecewise smooth solutions to systems of conservation laws,1437341,https://openalex.org/W1963644386,True,Relaxation Limit for Piecewise Smooth Solutions to Systems of Conservation Laws,True,publisher-specific-oa
,https://doi.org/10.1016/j.apm.2009.08.028,10.1016/j.apm.2009.08.028,,,doi,"['1872-8480', '0307-904X']",False,A direct updating method for damped gyroscopic systems using measured modal data,5775123,https://openalex.org/W1963776315,False,A direct updating method for damped gyroscopic systems using measured modal data,True,
,https://doi.org/10.1016/j.cnsns.2014.01.002,10.1016/j.cnsns.2014.01.002,,,doi,"['1878-7274', '1007-5704']",False,Modulational instability analysis of the Peregrine soliton,7175079,https://openalex.org/W1964232009,False,Modulational instability analysis of the Peregrine soliton,True,
physikerwelt commented 3 months ago

Example line 1: https://ir.canterbury.ac.nz/bitstream/10092/3347/1/12619269_ch-IntegAdEdSys.pdf,https://doi.org/10.1007/978-3-642-05039-8_8,10.1007/978-3-642-05039-8_8,,http://hdl.handle.net/10092/3347,doi,,False,Semantic integration of adaptive educational systems,5623263,https://openalex.org/W1904444722,False,Semantic Integration of Adaptive Educational Systems,True,

https://portal.mardi4nfdi.de/wiki/Item:Q5623263 Full data https://api.openalex.org/works/W1904444722

physikerwelt commented 3 months ago

Example line 3

,https://doi.org/10.1002/rnc.2973,10.1002/rnc.2973,,,doi,"['1049-8923', '1099-1239']",False,New results in robust functional state estimation using two sliding mode observers in cascade,6413175,https://openalex.org/W1941814427,False,New results in robust functional state estimation using two sliding mode observers in cascade,True,

https://portal.mardi4nfdi.de/wiki/Item:Q5177168

physikerwelt commented 3 months ago

@Daniel-Mietchen can you look at at different URLs and suggest how you would import them?

physikerwelt commented 3 months ago

I talked to the author of the CSV file. He suggested to use prime_landing_page_url, best_landing_page, best_pdf_url in that order

physikerwelt commented 3 months ago

CSV parser https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MathSearch/+/1011327

physikerwelt commented 3 months ago

The csv parser runs without any problems on the entire dataset.


root@91e78119efac:/var/www/html/extensions/MathSearch/maintenance# time php ImportOpenAlex.php /open_alex_data.csv

*******************************************************************************
NOTE: Do not run maintenance scripts directly, use maintenance/run.php instead!
      Running scripts directly has been deprecated in MediaWiki 1.40.
      It may not work for some (or any) scripts in the future.
*******************************************************************************

real    1m25.514s
user    1m15.850s
sys 0m2.000s
physikerwelt commented 3 months ago
PREFIX wdt: <https://portal.mardi4nfdi.de/prop/direct/>
PREFIX wd: <https://portal.mardi4nfdi.de/entity/>
SELECT ?qid WHERE {
    BIND (REPLACE(STR(?item), "^.*/Q([^/]*)$", "$1") as ?qid)
?item wdt:P1451 ?de
      FILTER (?de in (
"2636744", "6895265"))
}
LIMIT 2

gets DEs but is very ineffeicent (46s).

physikerwelt commented 3 months ago

With using haswbstatement (cf. #432 ) for individual items this can be reduced to 130ms

SELECT ?qid  WHERE {
   BIND (REPLACE(STR(?wbItemTitle), "^Item:Q(.*)$", "$1") as ?qid)
 SERVICE wikibase:mwapi {
   bd:serviceParam wikibase:endpoint "portal.mardi4nfdi.de";
      wikibase:api "Generator" ;
      mwapi:generator "search" ;
      mwapi:gsrsearch "haswbstatement:P1451=2559697";
      mwapi:gsrnamespace "120" .
      ?wbItemTitle wikibase:apiOutput mwapi:title
 }
}

Uses the API, cf https://portal.mardi4nfdi.de/w/api.php?action=query&format=xml&generator=search&gsrsearch=mardi&gsrnamespace=120

physikerwelt commented 3 months ago

This works only if the https://portal.mardi4nfdi.de/wiki/Item:Q576712?action=cirrusdump includes the statement of the de number


      "statement_keywords": [
        "P1451=2559697"
      ],
Daniel-Mietchen commented 3 months ago

Here is another variant (88ms):

PREFIX wdt: <https://portal.mardi4nfdi.de/prop/direct/>
PREFIX wd: <https://portal.mardi4nfdi.de/entity/>
SELECT 
(REPLACE(STR(?item), ".*Q", "Q") AS ?qid) 

WHERE {
  VALUES ?de  { "2636744" "6895265" }
  ?item wdt:P1451 ?de
}
LIMIT 2
physikerwelt commented 3 months ago

The first 100 imported items look good

root@91e78119efac:/var/www/html/maintenance# ./run ../extensions/MathSearch/maintenance/ImportOpenAlex.php oa.csv && ./run runJobs
2024-03-19 10:01:12 OpenAlex Special: jobname=openalex240319110312 rows=array(101) segment=0 requestId=a454d3e3cea7ae04cdca07fe namespace=-1 title= (uuid=2aa53c5dee784e449d495d10a5daae4a,timestamp=1710842472) STARTING
2024-03-19 10:01:49 OpenAlex Special: jobname=openalex240319110312 rows=array(101) segment=0 requestId=a454d3e3cea7ae04cdca07fe namespace=-1 title= (uuid=2aa53c5dee784e449d495d10a5daae4a,timestamp=1710842472) t=36813 good

https://portal.mardi4nfdi.de/wiki/Special:Contributions/Openalex240319110312

physikerwelt commented 3 months ago

Script is now running.

root@c46a4d101259:/var/www/html/maintenance# ./run ../extensions/MathSearch/maintenance/ImportOpenAlex.php /open_alex_data.csv && echo "done"
...
Push jobs to segment 24575.
Push jobs to segment 24576.
Pushed last 24577.
done

Progress can be tracked here https://portal.mardi4nfdi.de/wiki/Special:Contributions/Openalex240319020357 currently 1% (20k from 2M) done.

physikerwelt commented 3 months ago

ETA 12h ... might be done tomorrow morning

Daniel-Mietchen commented 3 months ago

@physikerwelt What do you think of documenting such user accounts in a way that allows others to trace the edits to a specific ticket or pull request or some such?

See https://portal.mardi4nfdi.de/w/index.php?title=User:Openalex240319020357&oldid=31230691 for an example.

physikerwelt commented 3 months ago

@Daniel-Mietchen yes. I did that for some already. One could also add a bit of automation that would describe which command had been entered and when in the terminal... I am not sure if that would help much?

Daniel-Mietchen commented 3 months ago

@Daniel-Mietchen yes. I did that for some already. One could also add a bit of automation that would describe which command had been entered and when in the terminal... I am not sure if that would help much?

It would be better than having no documentation. If you see a way to automate that even just a bit, great.

physikerwelt commented 3 months ago

It would be better than having no documentation. If you see a way to automate that even just a bit, great.

@Daniel-Mietchen it depends. The username says already what script was executed and when. If the username is read it means there is no additional information if blue one sees that someone entered information. Generating user profile pages automatically would destroy the blue / red logic.

physikerwelt commented 3 months ago

The job was completed. However, not everything could be imported.

root@4dd0d8795c52:/var/www/html/maintenance# wc -l /open_alex_data.csv 
2486267 /open_alex_data.csv

vs 1,841,457 edits.

The problems with the first run, which were reverted are on the order of 1%. Thus it does not explain the 0.5M missing entries. I suggest we close this for now and re-run the script when @LizzAlice reimports the new data from zbMATH Open.

@timconrad I don't know exactly how many new titles have been inserted. Can you give an estimate of how many titles are missing?