fmicompbio / einprot

Proteomics analysis workflows
https://fmicompbio.github.io/einprot/
Other
7 stars 0 forks source link

additional & multi species support #16

Open tobiasko opened 9 months ago

tobiasko commented 9 months ago

Dear einprot developers,

Currently einprot supports the following species

> getSupportedSpecies()
    taxId                         species        speciesCommon
1   10090                    Mus musculus                mouse
2    9606                    Homo sapiens                human
3    6239          Caenorhabditis elegans            roundworm
4    7955                     Danio rerio            zebrafish
5    7227         Drosophila melanogaster             fruitfly
6    4932        Saccharomyces cerevisiae        baker's yeast
7  284812 Schizosaccharomyces pombe 972h-        fission yeast
8   28377             Anolis carolinensis          green anole
9    9913                      Bos taurus               bovine
10   9615          Canis lupus familiaris                  dog
11   9796                  Equus caballus                horse
12   9685                     Felis catus                  cat
13   9031                   Gallus gallus              chicken
14   9544                  Macaca mulatta       rhesus macaque
15  13616           Monodelphis domestica              opossum
16   9258        Ornithorhynchus anatinus             platypus
17   9598                 Pan troglodytes           chimpanzee
18  10116               Rattus norvegicus           Norway rat
19   9823                      Sus scrofa                  pig
20   8364              Xenopus tropicalis tropical clawed frog

In order to analyse hybride proteome samples like PXD028735 it would be nice to include

Escherichia coli; NCBI TaxID: 562; Candida albicans (Yeast); NCBI TaxID: 5476

Would that be possible?

Related to this, the description of the species argument in the function runDIANNAnalysis() suggest that multi species analysis is in general not supported

Character scalar providing the species. Must be one of the supported species (see getSupportedSpecies()). Either the common or the scientific name can be used.

I think that multi species analysis is actually something rather common, since many samples also contain host factor contaminations, or people submit sample derived from virus infected cells. For all of these cases a multi species analysis could really help. I am also missing a species annotation for the feature/row data in the SummarizedExperiment container.

What do you think?

csoneson commented 9 months ago

Hi @tobiasko - thanks for the suggestion! In general I agree with you - the reason behind the list of supported species above is that those are the species supported by msigdbr, which we use to get GO annotations for the enrichment analysis. If no enrichment analysis is required, any species can be specified (as long as it's a single string), so in that sense analysis of multi-species experiments (or experiments with other species than those in the list) is supported for most of the workflow (I think the only other places it uses the species information are for the STRING annotations and for potentially adding species-specific columns to the output tables; here we can expand the list).

Now, for adding a column to the rowData, I think we would need to get the information about the species from the quantification file (unless we want to make too many assumptions). I'm not sure to what extent this information can be assumed to be available.

I'm trying to make a list of analyses that one might want to do specifically with multi-species experiments in einprot - I guess one would like to be able to annotate/explore each set of features separately, and maybe get separate summaries?

tobiasko commented 9 months ago

Now, for adding a column to the rowData, I think we would need to get the information about the species from the quantification file (unless we want to make too many assumptions). I'm not sure to what extent this information can be assumed to be available.

There is one way how one could do this if people stick to uniprot fasta DBs:

The column Protein.Names in the main report contains the uniprot ID, which can be split into two fields by _ and the 2nd component is always the species, for example RL4_HUMAN.

https://rest.uniprot.org/uniprotkb/P36578.txt

csoneson commented 9 months ago

Right. We could potentially have an additional argument (speciesCol), similar to the arguments to define the IDs and labels, where the user would provide a function to extract the species information from the imported columns.

tobiasko commented 9 months ago

Ahhhh there is a problem: The column might also contain a list like:

H7C2J9_HUMAN;Q5VWC4_HUMAN

and it could happen that the group shows something like

XXX_speciesA;YYY_speciesB

A multi-species protein group!

csoneson commented 9 months ago

🙈 Yeah, in this case I think the function would have to either choose one species, or give back speciesA;speciesB (and then perhaps this protein group would not be highlighted as belonging to any of the species specifically, but that's probably also the most conservative way of treating it).

csoneson commented 6 months ago

Hi, just to say that in the current version (0.9.4), the user can specify additional feature columns to extract (via the extraFeatureCols argument) that will be added to the rowData of the SummarizedExperiment and can e.g. be added to the feature table or used to highlight points in the interactive volcano plots. While it doesn't really add multi-species analysis support (I'm still not sure exactly what users would like to do here), it at least provides the option to visualize the information.