Nesvilab / FragPipe

A cross-platform proteomics data analysis suite
http://fragpipe.nesvilab.org
Other
184 stars 37 forks source link

Mark contaminant proteins #297

Closed christophgil closed 2 years ago

christophgil commented 3 years ago

Dear Dr. Fengchao,

A minor problem with GUI: the file selector box for the fasta file does not remember the last directory.

Further I have a question regarding contaminants. I generated the .fas file with contaminants in fragpipe. In the output files the revers decoys are recognized by a prefix and can easily filtered out in R. I was expecting a similar mechanism to filter out the contaminants but looking at an output line with a keratin from sheep as an example, I do not see an indication that it is a contaminant other than that it is not human.

In Maxquant the contaminants have a leading "CONT_" Thanks Christoph

On Sat, Jan 30, 2021 at 2:48 PM Fengchao notifications@github.com wrote:

Closed #297 https://github.com/Nesvilab/FragPipe/issues/297.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nesvilab/FragPipe/issues/297#event-4269141299, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6HC7LO52RZI6J3KG7TS4QE4VANCNFSM4W2E6GNQ .

fcyu commented 3 years ago

Hi Christoph,

The path is written to cache file and will be used next time opening FragPipe. You can check the cache folders to see if you have the written permission. The folders can be found by click 'clear cache and close'.

You are right, we don't mark contaminant proteins.

Best,

Fengchao

anesvi commented 3 years ago

We use to add Contam_ to contaminants

Felipe, why did we stop doing it?

Sent from my iPhone

On Feb 8, 2021, at 1:40 PM, Fengchao notifications@github.com wrote:

 External Email - Use Caution

Hi Christoph,

The path is written to cache file and will be used next time opening FragPipe. You can check the cache folders to see if you have the written permission. The folders can be found by click 'clear cache and close'.

You are right, we don't mark contaminant proteins.

Best,

Fengchao

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/297#issuecomment-775357735, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM62S2QI234UB7RDWCKDS6AV3JANCNFSM4W2E6GNQ.


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

prvst commented 3 years ago

Repeating here what we discussed internally. We dropped the automatic tagging and cleaning of the report tables because, at that time (2017 - 2018), we had collaborators studying some of the proteins that we normally call contaminants. Since the concept of "contaminant" can change from experiment to experiment, people normally remove them while doing their statistical and functional analysis.

christophgil commented 3 years ago

Dear Felipe,

That is no problem - I made myself an sed script as a workaround. ... s/^>sp|O76013|KRT36_HUMAN/>sp_cont|O76013|KRT36_HUMAN/1 s/^>sp|O76014|KRT37_HUMAN/>sp_cont|O76014|KRT37_HUMAN/1 s/^>sp|O76015|KRT38_HUMAN/>sp_cont|O76015|KRT38_HUMAN/1 s/^>sp|O77727|K1C15_SHEEP/>sp_cont|O77727|K1C15_SHEEP/1 ...

Is there a good way to obtain the list of contaminants. I did it very stupid by processing a minimal artifical fasta and extracting the entries.

Further I put the body of the uniprot fasta into one single line to allow for "grep -A 1 _HUMAN" and I wonder whether these long lines are compatible with the software or should I better fold the long lines? From the logs it seems that it operated smoothly.

In the GUI Window log panel I cannot Ctrl-F search anything. Can be worked around by pasting everything in a text editor.

Best regards Christoph

On Mon, Feb 8, 2021 at 8:03 PM Felipe Leprevost notifications@github.com wrote:

Repeating here what we discussed internally. We dropped the automatic tagging and cleaning of the report tables because, at that time (2017 - 2018), we had collaborators studying some of the proteins that we normally call contaminants. Since the concept of "contaminant" can change from experiment to experiment, people normally remove them while doing their statistical and functional analysis.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nesvilab/FragPipe/issues/297#issuecomment-775371417, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6BIU5ZRTY34PDKFKYLS6AYRNANCNFSM4W2E6GNQ .

prvst commented 3 years ago

Your changes should be fine. You can check here for the whole list as well. https://www.thegpm.org/crap/

christophgil commented 3 years ago

Great, thanks!

On Fri, Feb 12, 2021 at 3:14 PM Felipe Leprevost notifications@github.com wrote:

Your changes should be fine. You can check here for the whole list as well. https://www.thegpm.org/crap/

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nesvilab/FragPipe/issues/297#issuecomment-778220196, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRZU6H25CQWFWHHC3YVHUTS6UZUFANCNFSM4W2E6GNQ .

prvst commented 3 years ago

Added to v3.6.0

GianArauz commented 2 years ago

I'm a bit confused about the contaminant annotation in FP output files. According to this closed issue, it seems that, from Philosopher v3.6.0 onward, contaminant proteins should be tagged with Contam_ (just as @anesvi said on Feb 8).

Running the LFQ-MBR workflow in FP v16.0 with Philosopher v4.0.0, there is no clean way to filter-out contaminant proteins from the combined_protein.tsv output file. I'm right? Is this something still deliberate (as @prvst stated on Feb 8)?

fcyu commented 2 years ago

Yes, I reproduce your observation. There is no tag for the contaminant proteins. I will reopen this issue. Felipe @prvst could you please help to resolve this puzzle?

Thanks,

Fengchao

prvst commented 2 years ago

@GianArauz you are saying that the contaminant sequences are there, and the program is not removing them, is that right?

GianArauz commented 2 years ago

@prvst, exactly. For example: sp|P00921|CAH2_BOVIN P00921 CAH2_BOVIN CA2 260 18.80 Bos taurus

One would expect a boolean column called Contaminant or some kind of tag like: con_sp|P00921|CAH2_BOVIN P00921 CAH2_BOVIN CA2 260 18.80 Bos taurus

prvst commented 2 years ago

The contaminant tag is optional, as the addition of such sequences to the database. The reason is because the contaminants are added in batch, and there are people who actually study some of those proteins. To help deal with these different cases, the tagging can be done by adding the flag --contamprefix to the database command when annotating the file. This is also why we don't remove them automatically, people might actually want to see what type of contaminant they are hitting.

prvst commented 2 years ago

Also, @GianArauz if you already have your results, and you are working with human samples, something you can quickly do is the removal of hits to organisms that are not human plus keratin.

GianArauz commented 2 years ago

I managed to get the contam_ tag in the fasta using the --contamprefix flag. Thanks!

I think that would be nice to have this flag by default when getting the fasta by using the FP GUI: Database --> Download --> Add common contaminants (TRUE) --> OK.

I'm agree with @prvst that having the contaminants explicitly listed in output file is a must (either because one could be interested on some protein that is usually tagged as contaminant, or just because one needs to track how the "wet" part of the workflow is going on).

In any case, I think that it could be useful to enable the possibility of drop them in a more elegant way (by using the contam_ tag) instead of cherry-picking HUMAN except keratins.

Many thanks for your efficient response! And also for developing/sharing/maintaining FP!

fcyu commented 2 years ago

Thanks. We will add an option to FragPipe.

Best,

Fengchao

On Wed, 3 Nov 2021 at 12:19 PM, Gian Arauz @.***> wrote:

I managed to get the contam_ tag in the fasta using the --contamprefix flag. Thanks!

I think that would be nice to have this flag by default when getting the fasta by using the FP GUI: Database --> Download --> Add common contaminants (TRUE) --> OK.

I'm agree with @prvst https://github.com/prvst that having the contaminants explicitly listed in output file is a must (either because one could be interested on some protein that is usually tagged as contaminant, or just because one needs to track how the "wet" part of the workflow is going on).

In any case, I think that it could be useful to enable the possibility of drop them in a more elegant way (by using the contam_ tag) instead of cherry-picking HUMAN except keratins.

Many thanks for your efficient response! And also for developing/sharing/maintaining FP!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/Nesvilab/FragPipe/issues/297#issuecomment-959620359, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU27WZR7AIGJRUOGQ6TQEDUKFOKFANCNFSM4W2E6GNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Dr. Fengchao Yu Research Investigator University of Michigan

prvst commented 2 years ago

Thanks for the kind words. I'll discuss some possible changes with my colleagues. Cheers

fcyu commented 2 years ago

Added in version 17.1: https://github.com/Nesvilab/FragPipe/releases/tag/17.1