BioComputingUP / IDP-KG

Scripts and notebooks for generating and analysing the IDP-KG.
https://biocomputingup.github.io/IDP-KG/
Apache License 2.0
0 stars 2 forks source link

Compare IDP-KG stats with numbers from Ivan #12

Closed AlasdairGray closed 3 years ago

AlasdairGray commented 3 years ago

Message received from Ivan with details of overlap between the three datasets (summarised in attached figure):

For the numbers, I noted something like this:

  • common entries (overlapping entries) in DisProt and PED: 40; unique to DisProt: 1414; unique to PED: 42
  • common in MobiDB and DisProt: 1421; unique to MobiDB: 653; unique to DisProt: 33
  • common in MobiDB and PED: 44; unique to MobiDB: 2030; unique to PED: 38
  • common across all 3 datasets: 38

This was before the assignment of a list of UniProt proteins to each PED entry so I'll redo the numbers when I get in the lab. As I recall, only PED used isoform identifiers for some of it's entries so this may generate some problems matching data across datasets.

AlasdairGray commented 3 years ago

Current response over IDP-KG (Full) based on commit ac30059046932f00c94f9f301d0f08d29a89143d

description count
1 Distinct Proteins (Union) "2289"^^xsd:integer
2 DisProt Proteins "1615"^^xsd:integer
3 MobiDB Proteins "2073"^^xsd:integer
4 PED Proteins "83"^^xsd:integer
5 DisProt \ (MobiDB U PED) "179"^^xsd:integer
6 MobiDB \ (DisProt U PED) "637"^^xsd:integer
7 PED \ (DisProt U MobiDB) "33"^^xsd:integer
8 (DisProt U MobiDB) "2256"^^xsd:integer
9 (DisProt U PED) "1652"^^xsd:integer
10 (MobiDB U PED) "2110"^^xsd:integer
11 DisProt n MobiDB "1432"^^xsd:integer
12 DisProt n PED "46"^^xsd:integer
13 MobiDB n PED "46"^^xsd:integer
14 (DisProt n MobiDB) \ PED "1390"^^xsd:integer
15 (DisProt n PED) \ MobiDB "4"^^xsd:integer
16 (MobiDB n PED)\DisProt "4"^^xsd:integer
17 DisProt n MobiDB n PED "42"^^xsd:integer
AlasdairGray commented 3 years ago

Running query over full scrape dataset in commit 4b4c7ae5db27e2bc39267140080f55c3b8ebb9de

x description count
1 Distinct Proteins (Union) "2718"^^xsd:integer
2 DisProt Proteins "2062"^^xsd:integer
3 MobiDB Proteins "2073"^^xsd:integer
4 PED Proteins "90"^^xsd:integer
5 DisProt \ (MobiDB U PED) "604"^^xsd:integer
6 MobiDB \ (DisProt U PED) "617"^^xsd:integer
7 PED \ (DisProt U MobiDB) "34"^^xsd:integer
8 (DisProt U MobiDB) "2684"^^xsd:integer
9 (DisProt U PED) "2101"^^xsd:integer
10 (MobiDB U PED) "2114"^^xsd:integer
11 DisProt n MobiDB "1451"^^xsd:integer
12 DisProt n PED "51"^^xsd:integer
13 MobiDB n PED "49"^^xsd:integer
14 (DisProt n MobiDB) \ PED "1407"^^xsd:integer
15 (DisProt n PED) \ MobiDB "7"^^xsd:integer
16 (MobiDB n PED)\DisProt "5"^^xsd:integer
17 DisProt n MobiDB n PED "44"^^xsd:integer
AlasdairGray commented 3 years ago

Analysis now taking place in GSheet.

Found that there are 147 deprecated proteins in DisProt, and one MobiDB entry that did not scrape properly.

AlasdairGray commented 3 years ago

2021-09-28 version is now in sync with what is available on the websites of the data sources. Problems were due to using named graphs as proxies for pages and deprecated proteins being included in the sitemaps.