BioComputingUP / IDP-KG

Scripts and notebooks for generating and analysing the IDP-KG.
https://biocomputingup.github.io/IDP-KG/
Apache License 2.0
0 stars 2 forks source link

Full scrape data includes some other markup and junk pages #10

Closed AlasdairGray closed 3 years ago

AlasdairGray commented 3 years ago

At the moment, pages without schema:Protein types are ignored. Would be good to check what other types are in the full scrape and to grab some of that data. In particular, there is data about Dataset and DataCatalog.

To Do:

AlasdairGray commented 3 years ago

Only the file for the homepage contains markup, which is about the DataCatalog, Dataset, Citation, Organization, and contact person.

roqet -i sparql11 -e 'SELECT * WHERE { ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o}' -D 1.nq
roqet: Running query 'SELECT * WHERE { ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o}'
roqet: Query has a variable bindings result
row: [s=uri<https://disprot.org/#DataCatalog>, o=uri<https://schema.org/DataCatalog>]
row: [s=uri<https://bioschemas.org/profiles/DataCatalog/0.3-RELEASE-2019_07_01>, o=uri<https://schema.org/CreativeWork>]
row: [s=uri<https://doi.org/10.1093/nar/gkz975>, o=uri<https://schema.org/ScholarlyArticle>]
row: [s=uri<https://disprot.org/#2020-12>, o=uri<https://schema.org/Dataset>]
row: [s=uri<https://bioschemas.org/profiles/Dataset/0.3-RELEASE-2019_06_14>, o=uri<https://schema.org/CreativeWork>]
row: [s=uri<https://biocomputingup.it/#Organization>, o=uri<https://schema.org/Organization>]
row: [s=uri<https://creativecommons.org/licensEs/by/4.0/>, o=uri<https://schema.org/CreativeWork>]
row: [s=uri<https://bioschemas.org/crawl/v1/disprot/disprot/20210813/1/disprot.org/748790195>, o=uri<https://schema.org/Person>]
row: [s=uri<https://bioschemas.org/profiles/Organization/0.2-DRAFT-2019_07_19>, o=uri<https://schema.org/CreativeWork>]
roqet: Query returned 9 results
AlasdairGray commented 3 years ago

The MobiDB and PED homepages are victims of the BMUSE bug https://github.com/HW-SWeL/BMUSE/issues/79. Their scraped files contain no content.

AlasdairGray commented 3 years ago

Need to add Dataset and DataCatalog queries: