Open tobiasko opened 9 months ago
Hi @tobiasko - thanks for the suggestion! In general I agree with you - the reason behind the list of supported species above is that those are the species supported by msigdbr
, which we use to get GO annotations for the enrichment analysis. If no enrichment analysis is required, any species can be specified (as long as it's a single string), so in that sense analysis of multi-species experiments (or experiments with other species than those in the list) is supported for most of the workflow (I think the only other places it uses the species information are for the STRING
annotations and for potentially adding species-specific columns to the output tables; here we can expand the list).
Now, for adding a column to the rowData
, I think we would need to get the information about the species from the quantification file (unless we want to make too many assumptions). I'm not sure to what extent this information can be assumed to be available.
I'm trying to make a list of analyses that one might want to do specifically with multi-species experiments in einprot
- I guess one would like to be able to annotate/explore each set of features separately, and maybe get separate summaries?
Now, for adding a column to the
rowData
, I think we would need to get the information about the species from the quantification file (unless we want to make too many assumptions). I'm not sure to what extent this information can be assumed to be available.
There is one way how one could do this if people stick to uniprot fasta DBs:
The column Protein.Names
in the main report contains the uniprot ID, which can be split into two fields by _
and the 2nd component is always the species, for example RL4_HUMAN.
Right. We could potentially have an additional argument (speciesCol
), similar to the arguments to define the IDs and labels, where the user would provide a function to extract the species information from the imported columns.
Ahhhh there is a problem: The column might also contain a list like:
H7C2J9_HUMAN;Q5VWC4_HUMAN
and it could happen that the group shows something like
XXX_speciesA;YYY_speciesB
A multi-species protein group!
🙈 Yeah, in this case I think the function would have to either choose one species, or give back speciesA;speciesB
(and then perhaps this protein group would not be highlighted as belonging to any of the species specifically, but that's probably also the most conservative way of treating it).
Hi, just to say that in the current version (0.9.4), the user can specify additional feature columns to extract (via the extraFeatureCols
argument) that will be added to the rowData of the SummarizedExperiment and can e.g. be added to the feature table or used to highlight points in the interactive volcano plots. While it doesn't really add multi-species analysis support (I'm still not sure exactly what users would like to do here), it at least provides the option to visualize the information.
Dear einprot developers,
Currently einprot supports the following species
In order to analyse hybride proteome samples like PXD028735 it would be nice to include
Escherichia coli; NCBI TaxID: 562; Candida albicans (Yeast); NCBI TaxID: 5476
Would that be possible?
Related to this, the description of the
species
argument in the functionrunDIANNAnalysis()
suggest that multi species analysis is in general not supportedI think that multi species analysis is actually something rather common, since many samples also contain host factor contaminations, or people submit sample derived from virus infected cells. For all of these cases a multi species analysis could really help. I am also missing a species annotation for the feature/row data in the
SummarizedExperiment
container.What do you think?