JJAlmagro / subcellular_localization

55 stars 27 forks source link

DeepLoc vs. current SwissProt presents several inconsitenceis #4

Open sacdallago opened 3 years ago

sacdallago commented 3 years ago

Hi,

I'm trying to replicate your dataset (as could be downloaded from here: http://www.cbs.dtu.dk/services/DeepLoc/data.php) but using current SwissProt instead.

  1. I download the most recent SwissProt version
  2. I filter by ECO:0000269 -- experimental evidence used in manual assertion -- EXP

Now the issues start:

There is no convinent way of mapping to your "locations" from "sublocations" as in table 1 on https://academic.oup.com/bioinformatics/article/33/21/3387/3931857

Additional to this table, a CSV (or excel) file would have been nice. Something like:

DeepLoc SwissProt SwissProt Ontology
Cell.membrane Apical cell membrane SL-0015
Cell.membrane Apicolateral cell membrane SL-0017

I instead tried 2 things:

Merge SwissProt and DeepLoc annotations by means of the accession numer:

Since there will be some proteins with multiple swissprot and/or multiple deeploc annotations, this will result in something like:

swissprot deeploc_0 deeploc_1 deeploc_2 deeploc_3 deeploc_4 deeploc_5 deeploc_6 deeploc_7 deeploc_8
Cytoplasm Cytoplasm Nucleus Peroxisome Cell.membrane Mitochondrion Extracellular Endoplasmic.reticulum Lysosome/Vacuole Golgi.apparatus

where I then have to manually select which "deeploc_X" is the correct mapping from swissprot. Unfortunately, this procedure higlighted some inconsitencies, for example: SwissProt localizations never mentioned in Table 1, but associated to one or more DeepLoc localizations. An exerpt of things that didn't quite look right:

SwissProt DeepLoc(s)
Cleavage furrow Extracellular
Cytoplasmic granule lumen Extracellular
Glycosome Peroxisome
Sarcoplasmic reticulum lumen Extracellular
Recycling endosome Cell.membrane, Cytoplasm
Recycling endosome membrane Cell.membrane
Cell surface Extracellular
Cytoplasmic granule Plastid, Nucleus, Cytoplasm
Cytoplasmic granule lumen Extracellular

While things like "Glycosome" being "Peroxisome" is not a big deal, this was never mentioned in Table 1. It could derive from the version difference between SwissProt today vs. 2016, but worth mentioning. Other localizations seem far fetched (e.g. "Cytoplasmic granule lumen" == "Extracellular").

Filter SwissProt and DeepLoc for proteins with single localizations (each set separately), then merge by means of accession numbers

The idea behind this was to have a single, uneqivocal mapping from SwissProt to DeepLoc. Unforutnately, this highlighted some other inconsistencies:

SwissProt DeepLoc(s)
Cytoplasm Cytoplasm, Nucleus, Peroxisome
Endoplasmic reticulum Endoplasmic.reticulum, Nucleus
Mitochondrion Mitochondrion, Extracellular
Nucleus Nucleus, Cytoplasm
Peroxisome Peroxisome, Cytoplasm
Plastid Plastid, Cytoplasm, Endoplasmic.reticulum

In this case, there shouldn't be more than one mapping. What this suggests is that there are proteins marked as "Plastid" in SwissProt, but marked as either "Plastid", "Cytoplasm" or "Endoplasmic.reticulum" in DeepLoc. While this might be a natural evolution of better curation in SwissProt, it highlights that the DeepLoc set as on the webpage is ultimately not up-to-date, but in the absence of a clear, unequivocal mapping from SwissProt Loc names to DeepLoc, it's virtually not possible to get a new "DeepLoc" training set.

EDIT: Procedure up until now detailed: https://github.com/sacdallago/deeploc_redo

JJAlmagro commented 3 years ago

Hi Christian,

Yes you are right that due to the evolution of the annotations in UniProt, what in 2016 had one annotation now might have a different one. For example, it might had been added more experimental annotations to one protein leading to more than one experimental localization for what before was a single localization protein.

What do you want to achieve exactly? The same DeepLoc training set with up-to-date annotations?

sacdallago commented 3 years ago

Hi @JJAlmagro , thanks for getting back to me so quickly :) Hope you are doing well!

What do you want to achieve exactly? The same DeepLoc training set with up-to-date annotations?

Yes. The goal would be to re-create the DeepLoc type training(&testing) set from current SwissProt. I just realized from the statistics I got out over the weekend that while the distributions look similar to what you have in the paper, I get way higher numbers probably because I don't know how you programatically removed incomplete sequences (and because the set is not yet redundancy reduced!). Would be great if you had any scripts lying around that you used for the filtering of SwissProt before redundancy reduction, or before splitting into train/test :)

FYI my current numbers (again: no sequence length filter; no incomplete sequence filter; no redundancy reduction; all swissprot but with mapping as MANUAL_MAP here) :

Nucleus,10653
Cytoplasm,10335
Extracellular,6725
Cell.membrane,6119
Mitochondrion,2740
Endoplasmic.reticulum,2185
Lysosome/Vacuole,1482
Golgi.apparatus,1328
Plastid,1197
Peroxisome,300