ftwkoopmans / msdap

MS-DAP: downstream analysis pipeline for quantitative proteomics
GNU General Public License v3.0
30 stars 7 forks source link

remove_proteins_by_name #43

Closed szabzola closed 6 days ago

szabzola commented 6 days ago

Hi, I am using DIA-NN data with default contaminants added. This way DIA-NN uses a fasta file of cRAP proteins, in wich protein accessions, uniprot_id-s and gene names start with "crAP-". When i use remove_proteins_by_name() with 'gene_symbols = "cRAP"', nothing is removed, with 'regular_expression = "cRAP"" it works. (Only if I append DIA-NN cRAP fasta with these IDs, to my main fasta used for search). This was performed just after dataset = import_fasta(). My problem is, that after performing analysis_quickstart() there are still cRAP proteins in the dataset, including differential_abundance_analysis.xlsx, and there are valid foldchanges and p-values for those proteins. These can be of course removed from results, but my main concern is, that I want to perfom normalizations within analysis_quickstart() after removal of abundant contaminants, which may affect normalization. They can also serve QC, but I perform those (eg. total ion-intensity assigned o contaminants) outside MSDAP. Am I doing/understanding something wrong with the remove_proteins_by_name() function, should I make my own filtering function before analysis? best regards, Zoltan

szabzola commented 6 days ago

Sorry, I have just noticed your suggestion at Issue #39. Using that method (separate fasta files, and filter for "sp\|cRAP-" removes all cRAP proteins. Thanks,

ftwkoopmans commented 6 days ago

When i use remove_proteins_by_name() with 'gene_symbols = "cRAP"', nothing is removed

This works as expected, because these "cRAP proteins" do not have a fasta header that specifies the gene symbol as "cRAP".

with 'regular_expression = "cRAP"" it works. (Only if I append DIA-NN cRAP fasta with these IDs, to my main fasta used for search). This was performed just after dataset = import_fasta().

If you use this regular expression, all proteingroups that contain this sequence of characters anywhere in their fasta header will be removed. You might consider a more specific regular expression to avoid undesirable matches. Indeed, using the example code shown in https://github.com/ftwkoopmans/msdap/issues/39#issuecomment-2331945469 should work well.

Thanks for making this GitHub issue (and searching the repo for the answer); I will add an elaborate example to the documentation of the remove_proteins_by_name() function in the next MS-DAP release.