remove contaminants - Githubissues

xiaoyezao commented 1 year ago

Any advice for how to remove contaminants identified by Omark?

YanNevers commented 1 year ago

Hello @xiaoyezao

I can suggest three ways to remove contaminants detected by OMArk from its outputs.

If you only have the annotation, you could remove all sequences mapping to the contaminants, or any clade on its phylogenetic path (any protein that maps to a node in the species tree that is closer to the detected contaminant than the original species). This list of contaminants can be found in the file with .ump extension in the OMArk output, under the headers: ">Contamination_Full", ">Contamination_Partial", and ">Contamination_Fragment"). If you used the '-of' option (instructing OMArk to output the original FASTA files with their sequences categorized), you will find the same proteins in the FASTA file with the name of the contaminated clade in its file name. You could filter out these proteins from the original file. Note that this method may lack resolution because of possible mapping errors. Thus, it may both miss some contaminant proteins and include a few non-contaminant ones. However, it should filter out a majority of contaminants for further analysis.
I am working on a companion script that will allow you to map detected contaminant proteins to their corresponding contigs in the genome assembly. In principle, this would help identify contigs in the genome assembly from contaminant species and will help filter out the proteins coming from these regions. Since it would integrate additional information from the assembly, we believe this may have better resolution than solution 1. I hope to complete this script in the coming week and will post an update here when it is done.
Finally, if it is possible for you to come back directly to the assembly, I would recommend using a dedicated tool (BlobToolKit as example) to identify all contaminant contigs and filter them at this step. Since OMArk only considers protein coding genes, it may miss genomic sequences from contaminants that do not contain any coding sequence which I believe would be found with other alternatives.

I hope this will help you, Best, Yannis

xiaoyezao commented 1 year ago

Hi Yannis,

Thank you very much. I am waiting for your script. We will see how it works.

Bests,

Xiaoyezao

xiaoyezao commented 1 year ago

I am working on a companion script that will allow you to map detected contaminant proteins to their corresponding contigs in the genome assembly. In principle, this would help identify contigs in the genome assembly from contaminant species and will help filter out the proteins coming from these regions. Since it would integrate additional information from the assembly, we believe this may have better resolution than solution 1. I hope to complete this script in the coming week and will post an update here when it is done.

Hello @YanNevers , any updates on this script that you have been working on?

YanNevers commented 1 year ago

Hello @xiaoyezao

I apologize for the delay in coming back to you. Yes, I think I have completed a satisfying version of the script, It have been made available with b3e9ea6511de7f057774532734cbcd14158fe5b9 and can be found within the "utils" folder of this repo.

The script itself is named contamination_chromosome_filtering.py and there is a Jupyter Notebook that can provides an interactive interface Contamination_chromosome_filtering.ipynb.

It needs the gffutils library to run, on top of usual library needed for OMArk. As for the way it works, hopefully the help will indicate enough to run it but here is a quick description.

The script will look at the list of contaminant detected by OMArk and look for stretches of chromosomes or contig that have a density of contaminant genes higher than a certain threshold (It can be given as input, default:0.5). It will then extract all genes present in these part of the genomes, and create a FASTA file where these proteins have been removed from the input proteome.

To run the script, you can use: contamination_chromosome_filtering.py -i omark_folder -g gff -f input_fasta -o prefix_for_output_files

Where:

omark_folder is your omark results for your target proteome
input_fasta is the FASTA file of your proteome you used as input for OMAmer
gff is a GFF3 file correspoding to this proteome. The script have been tested with GFF files available from NCBI and Ensembl, it may need some adjustment for GFF from other source if the cprrespondance between protein ids and GFF entries is not obvious
prefix_for_output_files will be the prefix for the two ouptput files; the filtered FASTA file and a report file indicating which contig positions, genes and proteins were removed.

I hope this will prove useful to you.

PS: If your proteome have multiple contaminants, the previous behavior of OMArk was to report only genes that could be attributed unambiguously to one of them and not the others. Proteins placed at taxonomic level shared by multiple of the contaminants species were not reported as such. This may be an issue with this script, so if it is the case, I would advise you to install the latest omark version for this repo where this behavior have changed. If you had only one contaminant, this is not needed.

DessimozLab / OMArk

remove contaminants #19